This Month We Learned — February 2021

This Month We Learned — February 2021

This Month We Learned gives us an opportunity to share short stories about what we’ve learned and to highlight the ways we’ve grown and learned both in our jobs and outside of them. In February, we learned more about PostgreSQL indexing. Casey Cesari learned about speeding up fuzzy match searches with tri-gram indexes, and James Santucci learned how much he didn’t know about jsonb indexes.

Faster fuzzy searching in Postgresql with trigram indexes

Casey Cesari

A recent project we were working on required that we support fuzzy matching. Fuzzy matching in PostgreSQL using the LIKE/ILIKE operator, along with the wildcard character %, allows users to search for substrings within a string column. For example, in a database table of place names, the following query would return both “Copenhagen” and “The Hague”:

SELECT * FROM places WHERE name ILIKE ‘%hag%’;

While this query type is handy, Postgres has to search on the fly within the queried column to see if it can find that three character sequence. The database table in our case consisted of over 115 million records, which can lead this type of query to take several minutes to finish. Also, the default index type (B-Tree) doesn’t help here. Postgres can’t use it due to the leading % wildcard character because B-Tree indexes work by using the leftmost value in the indexed column to search the index. To confirm whether or not Postgres will use an index for a query, you can use EXPLAIN to see the query plan.

Thankfully, a trigram index can help speed up this type of query. A trigram is a sliding window of up to three characters. For example here is a trigram of “Copenhagen” generated by postgres:

psql=# select show_trgm('Copenhagen');

                      show_trgm                      
-----------------------------------------------------
 {"  c"," co",age,cop,"en ",enh,gen,hag,nha,ope,pen}

We can create a trigram index with the following statement:

CREATE INDEX trigram_index ON cities USING gin (city_name gin_trgm_ops);

(There are two types of trigram indexes. More on that here.)

Postgres can now compare the search query against the trigram entries in the index directly instead of having to do a substring search on the fly. With this index, doing a LIKE/ILIKE query in our database took a few seconds instead of a few minutes. For our use case, this was fast enough without any other optimizations.

Stubbing my toes repeatedly on JSONB indexes

James Santucci

Franklin is our implementation of the STAC API specification. STAC is a standard for organizing information stored in JSON about spatio-temporal data. It is open-ended, so anyone can create new extensions that change the set of fields that are available on any of the entity types.

Because STAC describes open-ended JSON data, we throw collections and items into nice jsonb columns in Postgres:

franklin=# \d collection_items
                                            Table "public.collection_items"
     Column     |           Type           | Collation | Nullable |                       Default
----------------+--------------------------+-----------+----------+-----------------------------------------------------
 id             | text                     |           | not null |
 geom           | geometry(Geometry,4326)  |           | not null |
 item           | jsonb                    |           | not null |
...
Indexes:
    ...
    "collection_items_properties_idx" gin ((item -> 'properties'::text))

And because we’re responsible and don’t want queries to take centuries, we have a nice GIN index to make our queries go fast. “Nice,” you might think, as I once did, “I bet all their problems were solved.”

This month I learned that all our problems were not solved.

The first problem was that different operators can use different kinds of indexes. In short, if you use the @> operator directly on the column that you’ve indexed, your index is great, but if you compare with a value retried from -> or ->> you can’t use your index. This query, for example, works great:

select item from collection_items where item -> 'properties' @> '{"cloud_cover": 10}';

Trying to compare with cloud_cover in particular can’t use the index though, so this query is slow:

select item from collection_items where (item -> 'properties' ->> 'cloud_cover') :: int = 10;

If you run almost the same query but from one level removed from the index, you’re also going to have a bad time. This query is applied to item instead of to item -> properties and can’t use the index, so for large collection_items tables it’s really slow.

select item from collection_items where item @> '{"properties": {"cloud_cover": 10}}';

You can run into similar problems with array values. For example, this query can use the index to find items with label:classes in their properties that happen to have a particular label class available:

select item from collection_items
where item -> 'properties' @> '{"label:classes": [{"name": "labels", "classes": ["example class"]}]}';

But if you instead check containment on the array in particular, you can’t use the index, it’s again very slow, and you’ll again be very sad:

select item where item -> 'properties' -> 'label:classes' @> '[{"name": "labels", "classes": ["example class"]}]';

Note that none of this handles anything other than equality. For example, if you want to search for items with cloud_cover < 10, you’re stuck unless you separately index cloud_cover.

This experience led us to reconsider how we should expose indexing in Franklin. We can’t know in advance which fields users will want to search, which they’ll want to search with non-equality comparison, or even which fields will exist. At the same time, “index everything” probably isn’t a great idea, since we don’t even know what “everything” means here and such an index would be huge.

For now, we’ve separately indexed space and time attributes. In a future month, I’ll learn how safely to expose an idea of “fields I care about” to end users.