3

I want to store and find back driver location data in the standard design uber/lyft problem.

I was researching about possible system design. Several videos and tutorials usually describe storing location data in sql as not efficient and recommend to use quad trees as a better alternative.

What about using an sql design with 2 indexes for both latitude and longitude coordinates and use a square that defines the range of latitude/longitude to get the data ?

I understand quad trees can be better but why, and how much better?

4
  • 3
    This seems like an optimization question, and the answers are always "it depends on your use case, and you can't be sure unless you profile". It's usually going to depend on access patterns, and we don't know yours.
    – Useless
    CommentedSep 12, 2024 at 21:48
  • @Useless thanks for always answering to my questions. It's just the standard design uber/lyft question and this is for driver location data.
    – codefast
    CommentedSep 12, 2024 at 22:17
  • 3
    @codefast: "standard design uber/lyft problem" - are we supposed to know what that means?
    – Doc Brown
    CommentedSep 13, 2024 at 5:19
  • 1
    I made some minor edits to put the context first and clarify the envisaged sql approach. Don't hesitate to adjust or rollback if the edits aren't accurate.CommentedSep 13, 2024 at 6:26

4 Answers 4

8

For querying spatial data, even when you have two separate indexes on latitude and longitude, databases will only be able to utilize them one after another, but not in a truely combined fashion. After applying the first index on, lets say, the latitude, the remaining set of data points has to be processed in full somehow. The index on the longitude works only on the full data set and cannot be used to reduce the subset of the data returned from the first query part efficiently without reindexing. Obviously, trying things the other way round by starting with the longitude won't change this.

For smaller data sets and certain query requirements reducing the data in one dimension might be fast enough. But if you are going to query all points in a 10m x 10m square, within in a total 10km x 10km raster, and you assume a median point density of ~1 point per square meter then

  • there are 100 million points in that raster

  • there are still 100,000 points which have to be processed one-by-ony after reducing the points to a stripe of 10m by utilizing the first index.

But what about index intersection?

Index intersection is a feature for making use of two indexes for queries. It is explained here for PostgresQL. For a query like WHERE x BETWEEN (x1,x1+10) AND y BETWEEN (y1,y1+10) this works as follows: it first applies the index on x and the index on y individually, yielding the ~100.000 points from each 10m stripe in form of two memory bitmaps. Afterwards it intersects the two bitmaps. This might be fast enough in reality for many applications, but obviously cannot reduce the order of growth of the running time.

A quad tree, however, is the two-dimensional equivalent of an otherwise single dimensional index. It will allow you to find the ~100 points with a running time which is proportional to the logarithm of the total number of points. If each of the tree levels reduces the number of points by, lets say a factor of 4, this means this will require a number of roughly log4(100 millions) steps through the tree, which means ~14 steps.

8
  • 2
    Although I get it, I wonder if this completely answers the question. Indeed, indexes are implemented using trees ( not binary trees but b-tries, which are N-ary trees to optimize paging); when using two indexes (or a composite index) the db optimizer may joyfully combine the trees, which could the look close to quad trees as the logarithmic log N argument also applies to the b-trees. What is missing here imho is how the quads allow to effectively reduce the candidates by ensuring the joyful combination of the two dimensions.CommentedSep 13, 2024 at 6:42
  • 4
    @Christophe: is this ("What is missing here imho is how the quads allow to effectively reduce the candidates") a question of yours? I mean, how, a quadtree works is explained well in Wikipedia. And no, an optimizer cannot "magically" combine two single-dimensional indexes into one two-dimensional - one needs a special data structure - the quad tree - for this.
    – Doc Brown
    CommentedSep 13, 2024 at 11:23
  • 4
    OP asked "I understand quad trees can be better but why", so it's not my question and I'll let OP judge in the end. Regarding the combination of several indexes by the db optimizer, I have some sympathy for your arguments, but I think some engines are doing advanced index intersection, to combine 2 indexes on two columns of a complex selection. I've not experimented very much with it but it may be applicable here without being total magic ;-)CommentedSep 13, 2024 at 14:45
  • 4
    @DocBrown E.g. Oracle's query optimizer definitely can intersect two indexes on two different columns. IIRC they call this a "Hash Join" in the execution plan. I'm sure, other databases do similar things.CommentedSep 13, 2024 at 15:15
  • 1
    @RalfKleberhoff: I am sure this will not work for intersections of two ranges {xmin,xmax], [ymin,ymax]. I am am sure indexing by hashing does not work for comparison operator "greater", "smaller", only for equality operators.
    – Doc Brown
    CommentedSep 13, 2024 at 16:23
4

Many databases has support for spatial indices. I think most of these uses R-Trees for indexing, as R trees should handle uneven point distributions better than quadtrees.

Several videos and tutorials usually describe storing location data in sql as not efficient and recommend to use quad trees as a better alternative.

My guess is that a database should be sufficient for your use case. You may also consider specialized databases or plugins that are specifically designed for geographical data, see PostGis as an example.

But for some use cases you may need to provide a specialized implementation that is optimized for your specific use case. You would probably not build a raytracer around a sql database as an example.

What about using an sql design with 2 indexes for both latitude and longitude coordinates and use a square that defines the range of latitude/longitude to get the data ?

Most database indices is essentially an ordering of rows. So it can only use a single index for lookup. Creating a composite index will also not really help much since there will probably be only a few points with the same exact latitude or longitude, so it would only eliminate points in on dimension in an efficient way.

I understand quad trees can be better but why, and how much better?

Quad trees, (or kd-trees, R-trees etc) allow you you quickly eliminate far away points in both dimensions. The improvement compared to a 1D index should be O(log n) compared to O(n log n)

You may consider the simplest spatial index, a simple grid. Take your lat/long coordinate and truncate it to the nearest kilometer in each dimension. You can then combine this into a single number by just concatenating the bits and use it as an index. To find all points within one km you would just fetch the cell containing your search point, as well as all neighboring cells, and apply a second, more exact filtering step.

The downside with this simple method is if you want to find points at some other radius than 1km, or if you want to find the closest point, regardless of how far away it is. This is where spatial trees have their advantage. A sparse quadtree can subdivide cells until each contains about the same amount of points, removing the need to specify a cell size.

    3

    There are SQL DMBS with a spatial index support. Those may use quad trees or other structures and will most likely outperform a custom solution.

      1

      The database is not necessarily a bad idea

      Searching for items in a square in an SQL database, requires to search for longitude and latitude each in a range.

      If this query is executed sequentially, first looking for points in the latitude range using the first index, then filtering them on the longitude range (or vice-versa) could require to go through a large number of items. If you have however only a few thousands of vehicles, it would not be an issue at all.

      If you would have a very large number of points, the sequential approach would be very inefficient. Fortunately, modern DBMS implement advanced optimisations techniques. One of it is called "index intersection": it uses the fact that two columns are used each having an index. It then looks at the intersection of the row-ids relevant for the two columns.

      I am not (or no longer since a while) a DBMS optimisation specialist. My takeaway was that the DB optimising engine is most of the time better than what we think. I do not know however if the popular DBMS that offer this technique are able to always spot its best use. I've read for example that older versions of the engine would rarely use it.

      So a very pragmatic approach would be to start with your idea, and analyse the performance with a representative data-load (remember, that if there re only a few demo data rows, the dbms engine may use another execution plan based on the known volumes). If you find a bottleneck, you may then consider to refactor your already working but slow software, to use positional data types that leading DBs support.

      I was wondering if the fact that DB vendors propose positional data types could be linked to inefficiencies in the query on attitude and longitude. But this is not necessarily a given. Because computing distance with GPS coordinates is not just 2D computations, as latitude and longitude are on a sphere and the square distorts the reality (again, it depends on the scale- within a single city, the distortion should be minimal). Moreover GPS has also an altitude, which may or may not matter in your case.

      Quad-tree alternative

      Implementing your own quad-tree solution rather than using positional types of a DBMS could be fun. But it may take time.

      The trick is to have the leaves positioned compared to their parents. It's as if the parent would be in a center of the square, and the leaves each correspond to a subdivision of the parent's square in 4. The more points, the smaller the subdivision of the squares they are in.

      Finding the potential nearby candidates consist then of a tree traversal starting at the point and looking for the neighbours in the same parent, then in other adjacent boxes, using the knowledge on the relative positioning (top-left, top-right, bottom-left, bottom-right) of the nodes. It is faster to search in the quad tree, because the subdivision of the space in smaller squares depends on the topology of points: the more points in one region, the smaller the boxes that subdivide it. This allows to discard a lot of false positives. On the other side, you'll need to update the tree as the position of the points moves, this is also challenging and could also require some workload.

      See also this article for additional explanations, and there are a number of videos on youtube that explains it quite well and visually, for example this one on building the tree, and this one on range search.

        Start asking to get answers

        Find the answer to your question by asking.

        Ask question

        Explore related questions

        See similar questions with these tags.