Saturday, October 28, 2017

Drinking Games with PostgreSQL: GIN and RUM


Fulltext support is a very important feature of most modern relational databases. For one, they enable fast information retrieval and for another, they simply allow application designers and operators to remain inside the known relational world. No need for a second database-like system, no need for additional maintenance, no need for yet another library, no need for a fundamentally different query language. You get the point.

Starting with version 8.3, PostgreSQL supports the tsvector and tsquery constructs and with it indexing support for the matching operator. This post will cover more of those indexing technology which we can utilize for accelerating regular expressions, LIKE-queries, fulltext and even JSON-related queries.
Juniper-Flavored Spirit

GIN stands for Generic Inverted Index and is the de facto standard index since PostgreSQL 8.3 for indexing tsvectors thus accelerating subsequent tsqueries.

Here is how it works:
  1. a document (basically text) is split into normalized words
  2. GIN is, simply speaking, a BTree of words mapping to document IDs
With this strategy, it is possible to index a lot of documents and retrieve them efficiently via the matching operator @@. If you want to read about it in detail, have a look at here and here.

In recent years, fulltext has become more and more important. So GIN has seen e.g. performance improvements in PostgreSQL 9.4. Reducing the index size and speeding-up multi-key lookups (rare + frequent) did the trick to accelerate the @@ operator even more for larger amounts of documents.

I can recommend this resource and this one, if you need more good reads.

BUT there's still room for improvement, and that's where more beverages come into play.

Distilled Alcoholic Beverage

It seems like the folks over at Postgres Professional have a preference for alcoholic beverages (please mind the vodka at the end of the presentation referred above). Good beverages appear to help pushing boundaries even further and build a brand-new index infrastructure. They called it:

RUM

Why do we need yet another index implementation? The following list shows the items missing efficient/general indexing support:
  • phrase operator (<->, <n>)
  • ranking (ts_rank, ts_rank_cd)
  • inverse fulltext search
  • inverse regular expression
  • position in JSON arrays
I didn't find out whether or not RUM is a acronym, but the basic idea is to store additional data among the usual GIN data [1][2]. This allows associating ordering information for efficient ranking, distance determination or positioning in JSON arrays. This approach increases the size of the index marginally but speeds up real-world queries several orders of magnitude.

In order to leverage RUM best, they also gave birth to an new ranking function: ts_score which is a middleground between the well-known ranking functions ts_rank and ts_rank_cd.
The slides are pretty clear about the shortcomings of those functions [1][2]:
  • ts_rank doesn't supports logical operators
  • ts_rank_cd works poorly with OR queries

I don't know completely how they plan to integrate this into PostgreSQL trunk if at all, but personally, I would love to see these features in PostgreSQL 11. They solve many real-world use-cases. So, keep up the good work!

Btw. if you want to give them a hand or review, stop by at their GitHub repositoy.

Enjoy the drinks 😉,
Sven

[1] https://pgconf.ru/media/2017/04/03/20170316H3_Korotkov-rum.pdf
[2] http://www.sai.msu.su/~megera/postgres/talks/pgopen-2016-rum.pdf

No comments:

Post a Comment