Map Search Optimization

CityHiker uses a TON of data. To give some perspective, the city of San Francisco contains a little over 50k street intersections. If we assume they are on an even square grid (they’re not), that means roughly 225 x 225. According to Wolfram, a grid graph Gmn has 2*mn-m-n edges. For a square grid, m = n; so there are 2(n^2-n) edges. That means there are a minimum of 100k line segments required to model the city of San Francisco. That does not account for dead end streets or grid edges comprised of multiple segments in a chain.

Map Search

Searching for line segments in a rectangular space seems easy enough. We do a bounding box query for both start and end points to get everything inside the bounding region. If we provide our query in terms of precise floating point values, that means our database must check every line segment in the entire collection each time a request is received. That doesn’t scale at all. As the database grows, requests will be slower and slower. We could do some kind of shard-based strategy, where we introduce a set of discrete collections of line segments and restrict requests to regions. That requires us to maintain some overlap between regions, so we can avoid the edge cases (literally).

None of the bounding box approaches really enable caching at the web service layer. In order to support caching in a more meaningful way, we need to have a query that has a 1-1 deterministic relationship with the results. Let’s dive into the nuances of this.

Bounding Box Approach

find all segments 
  where (lat0, lng0) <= startLat/startLng <= (lat1, lng1) 
  or (lat0, lng0) <= endLat/endLng <= (lat1, lng1)

As mentioned above, this requires a comparison of every item in the segments table. This is very slow for large data sets. Databases use indices to speed up the lookup of these results. However, the primary input here is four floating point numbers, representing the bottom-left and top-right of the lat/lng bounding box. If these numbers are conveyed as strings in the query, it looks something like this:

GET /segments?lat0=38.79912&lng0=-122.41456&lat1=38.79965&lng1=-122.41447

It’s fairly clear that subtle changes in the values result in different query strings, each with their own copy of results in the cache. This is bad, since minor differences in the query have potentially no effect on the results, filling the cache with duplicate result sets. Additionally, the likelihood that multiple clients send the same exact query is very low.

Discrete Grid Approach

find all segments 
  where (startLat * 1000) = gridLat 
  or (startLng * 1000) = gridLng 
  or (endLat * 1000) = gridLat 
  or (endLng * 1000) = gridLng

This approach dramatically increases the likelihood of multiple identical requests. This is because the input values are three orders of magnitude larger, collapsing many requests for possible high precision values into the same low precision query. Here’s an example.

Consider this input:

(38.79912, -122.41456 -> 38.79965, -122.41447)

With the discrete grid approach, the above resolves like this:

(38799, -122414 -> 38799, -122414)

Note that both corners are now the same value, so we can consider the input as a single grid cell, indexed on the lat/lng pair.

Now, consider input nearby the input:

(38.79936, -122.41498 -> 38.79912, -122.41466)

Both input values resolve to the same discrete grid values, yet they are nearly a hundred meters apart. That’s the key. This discrete grid approach consolidates all the results in a discrete rectangular bounding box into one result set. By essentially rounding up to the nearest hundred, all segments in the rounded bounding box can be cached for repeat requests. We are no longer querying a range, so the data source can simply look for results based on the lat/lng pair. The query might look something like this:

GET /segments/38799/-122414

Side Effects

There are some beneficial side effects to using the discrete grid approach. Adjacent regions can be queried by incrementing or decrementing the input values, allowing easy precaching of neighboring data sets. Also, we can stratify the discrete grid into layers, representing collections of results at different grid sizes.

Since we used a discretization factor of 1000, we are effectively defining a hundred-meter grid. This is effective for wide zoom levels and has large results data sets. We can use a factor of 10000 to define a finer grid, which is more effective at supporting narrow zoom levels.

Important note: The results in one hundred-meter grid cell contain all of the results for all contained ten-meter grid cells. These are identical data points, so it’s unnecessary to make a ten-meter request after receiving results from a hundred-meter request that contains the ten-meter cell.

Data Optimization

Given the restrictions of a JSON web service, all data values are stringified. This is inconvenient for large result sets, as it means a large number of bytes per query. This has impacts on infrastructure budget, since bandwidth is usually not free. Therefore, we have strong motivation to optimize the way we represent the line segment data.

Computing Discrete Keys

A line segment is represented by start and end coordinates (latitude, longitude, altitude), a globally unique id, and a value for percentGrade. Each segment also has a computed value representing the hundred-meter grid index. Let’s dive into computation of this index.

Both latitude and longitude are given in values with 5 digits of precision, like this:

(38.79912, -122.41456)

This roughly equates to a one-meter grid. Let’s focus on latitude.

38.79912 (32-bit float)

Multiplying that by 100,000 gives a max absolute value of 18,000,000. That fits into a 26-bit signed integer, but we will use a 30-bit unsigned integer to represent it. For clarity, that’s a 25-bit unsigned value in bits 0-24, plus a sign bit at 29.

3879912 (25-bit unsigned), 0 (sign bit)

Now, let’s look at longitude.

-122.41456 (32-bit float)

12241456 (25-bit unsigned), 1 (sign bit)

We use 30-bit values because they can be represented by five Base64 characters. This way, we concatenate the Base64 value for latitude and longitude together to compute the hundred-meter grid cell index as a ten-character Base64 string.

Using this approach, we can construct a set of hundred-meter indices for a given arbitrary bounding box. Each index can be looked up in a local hash map, mapping Base64 string index to segment array.

Data Transport

The above approach is actually very useful for compressing JSON data for transport. Consider the following serialized segment:

{
  "id": 123, 
  "start_lat": 38.79912, 
  "start_lng": -122.41425, 
  "end_lat": 38.79923, 
  "end_lng": -122.41428, 
  "percent_grade": 8.52
}
// 115 bytes minimum

If we apply the above compression strategy, we can render this string with substantially fewer bytes, as follows:

{
  "id": 123, 
  "start": "aArzpJuImF", 
  "end": "aAryqJuIna", 
  "percent_grade": 8.52
}
// 60 bytes minimum

If we’re really squeezing the life out of it, we can render the percentGrade value as an 8-bit signed int in hexadecimal, but that really only saves a couple bytes per segment.

Overkill?

If we’re really keen on trimming everything down to the bare metal, we can represent a segment purely as a Base64 string, as long as we pack the data into 6-bit boundaries. That might look something like this:

0-29: start latitude (30-bit)

30-59: start longitude (30-bit)

60-89: end latitude (30-bit)

90-119: end longitude (30-bit)

120-179: guid (60-bit)

180-185: percent grade (6-bit)

This results in 31-character Base64 values, which can either be concatenated or provided as array elements, like this:

["aArzpJuImFaAryqJuInaaaaaaaaaGcN", ... ]
// 33 bytes fixed

Observant readers will note there is another beneficial side effect of this strategy. The ten-character prefix of this string is the one-meter key. Built-in index!

At this point, we barely save anything by abandoning JSON in favor of a pure string response. Also, the concatenated string approach is more susceptible to in-transit corruption, whereas a JSON-formatted string would fail to parse in that case.

All told, for an average cell with 1000 segments, we save 70%. That’s the difference between 115KB and 35KB. Bear in mind, this is the response body size. To fetch all segments for the city of San Francisco, we need at least 100 requests (max 1000 per request). The total response size is 3.5MB, compared to the uncompressed 11.5MB. This has a measurable effect on mobile device bandwidth.

CityHiker