WEBVTT 00:00.000 --> 00:12.880 Hello everyone, I'm Elia, I'll be speaking about visualizing mobility data, so the previous 00:12.880 --> 00:19.240 talk was about preparing, collecting data, but we need to be able to analyze it and understand 00:19.240 --> 00:24.040 it when it's good visualization tools as well. 00:24.040 --> 00:32.040 So I am a social engineer, I've been working mostly on geospatial data visualization, 00:32.040 --> 00:38.200 and yeah, I've participated in many projects, working with mobility data, representing 00:38.200 --> 00:42.560 it in various ways, and my PhD was focused on that topic as well. 00:42.560 --> 00:47.960 Also I was working for a company called Terreletics, where we visualized mobility in cities, 00:47.960 --> 00:53.960 how people move in cities for different transportation providers to help them understand 00:53.960 --> 00:56.960 the demand and improve their services. 00:56.960 --> 01:06.640 So we built this dashboard, which allowed people to look at mobility data in cities, and the 01:06.640 --> 01:13.960 company I worked for was a generous enough to allow us to open source this flow mapping 01:13.960 --> 01:21.320 layer as an open source library, which we did, so there is a slow map jail library, which 01:21.320 --> 01:27.320 is a DeGiel layer, custom DeGiel layer implementation, if you have familiar with DeGiel, 01:27.320 --> 01:34.120 which is a framework for efficient geo data visualization. 01:34.120 --> 01:40.480 So you can use, but then I realized that not everybody is a programmer, it was a couple of years 01:40.480 --> 01:47.480 ago now, everybody is, of course, so I developed a tool which people can use without having 01:47.480 --> 01:55.480 to know how to program in JavaScript, which is like, you just throw some data out of 01:55.480 --> 02:03.480 all the data must be in particular format, to Google sheet, Google spreadsheet, then you 02:03.480 --> 02:09.720 just pass the URL to the stool, and it will magically visualize it as an interactive map, and 02:09.720 --> 02:14.720 people started using it in publishing stuff, that was pretty cool. 02:14.720 --> 02:20.720 Then some people started to contribute in back, like there is, for instance, R integration 02:20.720 --> 02:24.720 developed by the gore code of somebody from the community. 02:24.720 --> 02:27.720 So yeah, briefly, what is a flow map? 02:27.720 --> 02:31.720 Who's familiar with a concept flow map? 02:31.720 --> 02:38.720 Not that many people, okay, so it's about utilizing numbers of movements of people or whatever, 02:39.720 --> 02:42.720 it can be good, whatever entities between pairs of geographic locations. 02:42.720 --> 02:45.720 You're not very interested in the exact route. 02:45.720 --> 02:50.720 The people take is more about how many people people or entities move from A to B, 02:50.720 --> 02:55.720 and then you can have additional attributes like time or mode of transport, whatever. 02:55.720 --> 03:02.720 So a flow map is a way to represent that, and often a thickness of the arrows is used to represent 03:02.720 --> 03:08.720 the amount of people moving all the color, or sometimes you see this animated particle animations, 03:08.720 --> 03:14.720 like the number of particles will be represented in the number of people moving. 03:14.720 --> 03:16.720 And the direction is important as well. 03:16.720 --> 03:21.720 So arrow is like naturally understood representation of that. 03:21.720 --> 03:25.720 And the kind of data which is can be represented way, 03:25.720 --> 03:30.720 often called origin destination data OD data, so you have basically a table. 03:30.720 --> 03:33.720 You have origin destination and count. 03:33.720 --> 03:39.720 You can have additional columns for attributes like time or mode of transport. 03:39.720 --> 03:48.720 So in developing this tool, I put a lot of effort into making the visual representation very readable, very understandable. 03:48.720 --> 03:58.720 So in flow map jail, there is a double encoding for the amount of people moving, number of people moving. 03:58.720 --> 04:00.720 It's the thickness and the color. 04:00.720 --> 04:04.720 It is also the sorting of the arrows so that the most important points are at the top. 04:04.720 --> 04:09.720 The arrows have outlines so that when you put a lot of them, they overlap. 04:09.720 --> 04:13.720 You don't have a, they're still recognizable, this is individual arrows. 04:13.720 --> 04:19.720 Then there is a fading, which is like you can configure basically the amount of, 04:19.720 --> 04:24.720 downplay in the less important flows, you can adjust the get darker. 04:24.720 --> 04:30.720 And this plays well with blending, which is used in flow map jail. 04:30.720 --> 04:36.720 So basically it's a way to make sure that underline base map is still readable, 04:36.720 --> 04:39.720 even when you have like thousands of arrows overlapping. 04:39.720 --> 04:45.720 And it's not, it wouldn't work with just opacity because with opacity when you have overlapping arrows, 04:45.720 --> 04:48.720 they would still, like the color would still add up. 04:48.720 --> 04:51.720 So you wouldn't be able to read underline map. 04:51.720 --> 05:00.720 But with blending, you can just use CSS blending because the arrows are rendered in the separate web jail context. 05:00.720 --> 05:02.720 This is a technicality. 05:02.720 --> 05:07.720 But never mind, you can make it so that despite thousands of overlapping arrows, 05:07.720 --> 05:11.720 you can still read the underlying base map. 05:11.720 --> 05:16.720 Then the location totals are presented as circle sizes. 05:16.720 --> 05:20.720 It's a bit more complex than that, but I won't go into the detail. 05:20.720 --> 05:24.720 There is an alternative way of representing the directionality of the arrows. 05:24.720 --> 05:27.720 You can use this fancy animation. 05:27.720 --> 05:34.720 Sometimes it's actually more readable, but often it's just like more appealing. 05:34.720 --> 05:43.720 And still, like with all these techniques, you have too many flows which are kind of producing a noisy picture. 05:43.720 --> 05:49.720 And reducing the performance is not a great way when you have to render all of them. 05:49.720 --> 05:55.720 But many of them actually know it's like, here you see there are flows which start and end outside of the viewport. 05:55.720 --> 05:57.720 Why, what's the point of rendering them? 05:57.720 --> 05:59.720 They only obscure the picture. 05:59.720 --> 06:05.720 So what's lemma, jail is doing is actually an adaptive filtering. 06:05.720 --> 06:14.720 We filter the flows so that only those which are within the start or end within the viewport are rendered. 06:14.720 --> 06:18.720 Also, we don't show more than like a certain number of flows. 06:18.720 --> 06:21.720 You can adjust that. 06:21.720 --> 06:26.720 So when you like zoom in, you see the detail for this particular region. 06:26.720 --> 06:29.720 We also adjust the cables, right? 06:29.720 --> 06:37.720 So when you zoom in, you see the flows which are the largest for this particular viewport will pop up. 06:37.720 --> 06:42.720 So this way you can explore the regions in more detail when you zoom in. 06:42.720 --> 06:46.720 And it improves performance. 06:46.720 --> 06:54.720 Still, you often can get some messy datasets like this one which is bus travels in San Paulo. 06:54.720 --> 07:02.720 Here, like the length of the aerosystem relatively long compared to the relation to the distribution of the locations. 07:02.720 --> 07:04.720 So you get lots of over-plotting, right? 07:04.720 --> 07:06.720 And it's not very readable. 07:06.720 --> 07:08.720 Or here, the opposite extreme. 07:08.720 --> 07:12.720 You have migration in Australia where most people migrate close by, right? 07:12.720 --> 07:14.720 Not across the whole country. 07:14.720 --> 07:18.720 So here you basically don't see the flows at all because that's too short. 07:18.720 --> 07:21.720 So how can we address that? 07:21.720 --> 07:26.720 Maybe we can like create a useful summary which would work at any zoom level. 07:26.720 --> 07:32.720 And for like, independently of what the exact distribution of the locations and flows are. 07:32.720 --> 07:38.720 So for that, the flow of an object is doing hierarchical clustering. 07:38.720 --> 07:43.720 So who's familiar with the hierarchical clustering? 07:43.720 --> 07:45.720 Some people. Okay, so that's fine. 07:45.720 --> 07:47.720 Basically, we have the locations. 07:47.720 --> 07:49.720 We calculate the total flows for the locations. 07:49.720 --> 07:53.720 The total outgoing incoming or like, it can be different metrics. 07:53.720 --> 07:56.720 But how important basically the locations are? 07:56.720 --> 08:03.720 And we start with the largest so that when we group them together, the largest has the most, 08:03.720 --> 08:09.720 like basically define where the cluster will be located, right? 08:09.720 --> 08:15.720 So we don't move the Brussels to some smaller location. 08:15.720 --> 08:21.720 Then we have some, for every viewport, we have zoom level. 08:21.720 --> 08:27.720 We have a certain radius within which we group together locations, which are close by to the locations we start with, right? 08:27.720 --> 08:30.720 So we group them together, we get the clusters. 08:30.720 --> 08:35.720 We move them into the center of masses, weighted by the total flows. 08:35.720 --> 08:38.720 And we get like that cluster locations at this level. 08:38.720 --> 08:45.720 Now we have to recalculate the flows so that we basically aggregate all the flows between the constitutes of the clusters. 08:45.720 --> 08:47.720 And we get the new flows at this level. 08:47.720 --> 08:52.720 So this way we get aggregated summary for this particular zoom level. 08:52.720 --> 09:02.720 And what it's called hierarchical is that we can repeat this process and go up and build a hierarchy of clusters for a number of zoom levels where I interested in. 09:03.720 --> 09:17.720 So with this approach from this messy data says we get something like this, which is much more readable, but we like when we just look only a zoom level, we will lose the details, right? 09:17.720 --> 09:31.720 The good thing is that this is fascinating so that we can do it interactively with zoom in and out and the clusters will expand and collapse depending on the, 09:31.720 --> 09:43.720 zoom level. So this is a public transport and Brisbane and you can see like you can zoom in and see more details about the particular region, how people move there. 09:43.720 --> 09:49.720 This is a true traffic. So with this approach, this is also true traffic. 09:49.720 --> 09:58.720 It's not like origin destination, it's how many people move it within each segment, which can also be kind of converted into OD. 09:59.720 --> 10:11.720 This is a public transport and Zurich and here it is also a temporal dimension you can and you can filter by the tram line or over time. 10:11.720 --> 10:27.720 So by the way, this app is not open source, but much of the technology behind it is and I'm going to open source parts of it is a part of a new project I will mention, but this one is using duck DB, which for me they would duck DB. 10:27.720 --> 10:36.720 A few people, okay, so this is a pretty cool database which has some advantages for this kind of use cases. 10:36.720 --> 10:46.720 First it's made for analytics, it's using a columnar, data representation, which means queries aggregation for instance are pretty efficient. 10:47.720 --> 11:07.720 It's also embeddable in really anywhere. So this example I showed you before that DB was running in the browser directly, so there's no back end serving this queries and every time I zoomed in and out or moved the viewport the word like dozens of SQL queries preparing the data for this particular viewport. 11:07.720 --> 11:18.720 And that DB, since it can run in the browser, by WebAssembly, can do that and it requires very low inference setup. 11:18.720 --> 11:32.720 So basically you just need to download the data from an S3, like a profile, add it to the duck DB running in the browser and you are good to go, you can run queries directly in the browser. 11:32.720 --> 11:41.720 So yeah, in this app, like when you load the data set, it will prepare, it's too small, I know, but don't worry. 11:41.720 --> 11:53.720 It's prepared like pretty aggregated tables, which will make it easier for to run later queries for particular viewport to get the data needs. 11:53.720 --> 12:07.720 So like for different zoom levels, for instance, there is a mapping between original locations and clusters on each zoom level so it can quickly do the mapping and prepare the data. 12:07.720 --> 12:18.720 So this is them. If you want to try it out, you can quickly scan. I won't be showing it for the sake of time. 12:18.720 --> 12:36.720 So yeah, now I'm working on this new open source project. It's called SQL Rooms and it's basically a framework, which helps you to build data analysis apps and it's kind of backed by that DB. 12:36.720 --> 12:47.720 And it's very modular, like the different kind of modules you can functionality, you choose to, you decide which functionality you want to add into your application. 12:47.720 --> 13:07.720 So yeah, like basic data set browser, database browser, you can see the tables you have, you can have an AI systems, which will generate queries for you, SQL queries, which can still run in the browser without sending data to any, a lot of providers. 13:07.720 --> 13:20.720 Many things, and I'm also, yeah, it's been used by a couple of several companies specifically for data intensive data visualization applications. 13:20.720 --> 13:33.720 And I'm working now on, like, it will be an example showing a flow map, which serves the data directly from the DB in the browser. 13:33.720 --> 13:47.720 And yeah, so yeah, and as part of this work, I wanted to have a demo for this conference, but I haven't managed live gotten tense. 13:47.720 --> 13:52.720 Sorry about that, but it's like working progress and explaining what I'm working on. 13:52.720 --> 14:01.720 So for even larger data sets, that DB is already pop pretty powerful. You can have like millions of rows in your, in your flow stable and will be good enough. 14:01.720 --> 14:11.720 But what if you have like many attributes or small temporal back buckets, the number of rows in your table in a flow stable will multiply very quickly. 14:11.720 --> 14:18.720 And at some point it will be too large to, to load the entire thing into the browser. 14:18.720 --> 14:30.720 And you want to avoid that. So, but what you can do, you can prepare your data set in a way so that you can only fetch parts, which you personally need for the current viewport. 14:30.720 --> 14:41.720 And that DB supports HTTP range queries. So basically, if you have a archive file somewhere in S3. 14:41.720 --> 14:51.720 And it has a like column by which you can filter, you can say I want like this range for values of this column. 14:51.720 --> 14:59.720 And if the rows are sorted in the right way, then it will only need to read those parts of the table, which satisfy your, 14:59.720 --> 15:12.720 your query condition, right? So if you can do it this way, then it will be fetching data on demand as you change your viewport without having to load the entire table. 15:12.720 --> 15:25.720 And so to do this, who is familiar with space filling curves? Just a few people, okay? So this is a mathematical concept, but it's actually actually easy to explain. 15:26.720 --> 15:35.720 Like the purpose is to compress two dimensions into single dimension. So you have like two columns x and y, right? 15:35.720 --> 15:44.720 Oh, let it you launch it better use x and y projected already. And then you want to, this are two numbers, you can save them separately. 15:44.720 --> 15:54.720 But our purpose is like to have a column by which we can sort so that then it will be easy to query so that things which are closed by close to each other. 15:54.720 --> 16:03.720 I'll also close in this table close to each other so that we can, we need to fetch fewer chunks, right? 16:03.720 --> 16:13.720 And so this is the way to do that. Basically, you draw a curve, which would fill the entire plane, right, to the plane. 16:13.720 --> 16:20.720 And then like at every junction you put a number one, two, three, four, and so on. And this number is the index, right? 16:20.720 --> 16:32.720 Which you can then put in your column. And this way you can reduce two dimensions x and y into single index, right? 16:32.720 --> 16:42.720 And this will be the number you want to sort it by, so that you then need to, to read a few, a few chunks from the table. 16:42.720 --> 16:54.720 If you sort by this number. And that VP has a function as the Hilbert, that DB spatial extension, you need to load the spatial extension, to calculate this index. 16:54.720 --> 17:07.720 And as part of the project I'm working on, there will be a Python script preparing like you give it a normal OG dataset with just like a simple plane origin destination count table. 17:07.720 --> 17:14.720 This will prepare this kind of, part of our sorted and with the index. 17:14.720 --> 17:20.720 Yeah, the only, yeah, so this is like kind of a SQL query you need to prepare this. 17:20.720 --> 17:31.720 But the trick is that you need to do this calculated calculation of the Hilbert in its twice, because we have let alone, right? 17:31.720 --> 17:38.720 But we have it for the origin and destination for OG data, you don't, it's not just one point, right? It's starting then. 17:38.720 --> 17:48.720 So you can, but you can do it twice. You get it first for each for the origin destination, let alone. 17:48.720 --> 17:54.720 And then you get two numbers, right? But you can also use as the Hilbert for these two numbers, you get for the origin destination. 17:54.720 --> 17:58.720 And it compresses it again. So you get from four dimensions into single dimension. 17:58.720 --> 18:04.720 You sort your table by the single, a column flow H in this code case. 18:04.720 --> 18:13.720 And then the flows which will have, which will have the same or closed number in this column will be also close to each other on the map. 18:13.720 --> 18:18.720 And then we can, we don't need to read as many chunks to represent this data. 18:18.720 --> 18:23.720 So yeah, that's it. 18:23.720 --> 18:29.720 Follow this word, yeah, best is probably SQL rooms, where new stuff will be appearing. 18:29.720 --> 18:35.720 And otherwise, flow map jail, flow map blue, flow map blue are, flow map city. 18:35.720 --> 18:39.720 And here's the example, you can try. Thank you. 18:39.720 --> 18:46.720 Thank you. 18:46.720 --> 18:53.720 There's a time for questions. No, okay, sorry about that. 18:53.720 --> 18:56.720 But reach out to me.