Fulltext support is a very important feature of most modern relational databases. For one, they enable fast information retrieval and for another, they simply allow application designers and operators to remain inside the known relational world. No need for a second database-like system, no need for additional maintenance, no need for yet another library, no need for a fundamentally different query language. You get the point.
Starting with version 8.3, PostgreSQL supports the tsvector and tsquery constructs and with it indexing support for the matching operator. This post will cover more of those indexing technology which we can utilize for accelerating regular expressions, LIKE-queries, fulltext and even JSON-related queries.
Juniper-Flavored Spirit
GIN stands for Generic Inverted Index and is the de facto standard index since PostgreSQL 8.3 for indexing tsvectors thus accelerating subsequent tsqueries.
Here is how it works:
- a document (basically text) is split into normalized words
- GIN is, simply speaking, a BTree of words mapping to document IDs
In recent years, fulltext has become more and more important. So GIN has seen e.g. performance improvements in PostgreSQL 9.4. Reducing the index size and speeding-up multi-key lookups (rare + frequent) did the trick to accelerate the @@ operator even more for larger amounts of documents.
I can recommend this resource and this one, if you need more good reads.
BUT there's still room for improvement, and that's where more beverages come into play.
Distilled Alcoholic Beverage
It seems like the folks over at Postgres Professional have a preference for alcoholic beverages (please mind the vodka at the end of the presentation referred above). Good beverages appear to help pushing boundaries even further and build a brand-new index infrastructure. They called it:
RUM
Why do we need yet another index implementation? The following list shows the items missing efficient/general indexing support:
- phrase operator (<->, <n>)
- ranking (ts_rank, ts_rank_cd)
- inverse fulltext search
- inverse regular expression
- position in JSON arrays
In order to leverage RUM best, they also gave birth to an new ranking function: ts_score which is a middleground between the well-known ranking functions ts_rank and ts_rank_cd.
The slides are pretty clear about the shortcomings of those functions [1][2]:
- ts_rank doesn't supports logical operators
- ts_rank_cd works poorly with OR queries
I don't know completely how they plan to integrate this into PostgreSQL trunk if at all, but personally, I would love to see these features in PostgreSQL 11. They solve many real-world use-cases. So, keep up the good work!
Btw. if you want to give them a hand or review, stop by at their GitHub repositoy.
Enjoy the drinks 😉,
Sven
[1] https://pgconf.ru/media/2017/04/03/20170316H3_Korotkov-rum.pdf
[2] http://www.sai.msu.su/~megera/postgres/talks/pgopen-2016-rum.pdf
No comments:
Post a Comment