赞
踩
Did you know PostgreSQL supports indexing a subset of your table? This enables very fast reads from that subset with almost no index overhead. It’s often the best way to index your data if you want to repeatedly analyze rows that match a given WHERE clause. This makes PostgreSQL a great fit for workflows that involve pre-aggregation with additional ad hoc analysis. In this post, I’ll walk through an example query optimization in which partial indexes are well suited.
Consider a table of events with the following schema:
CREATE TABLE event (
user_id BIGINT,
event_id BIGINT,
time BIGINT NOT NULL,
data JSON NOT NULL,
PRIMARY KEY (user_id, event_id)
)
Each event is associated with a user and has an ID, a timestamp, and a JSON representation of the event. The JSON includes the page path, the event type (e.g. click, page view, form submission), and any other properties that describe the event.
We can use this table to store many different kinds of events, and let’s assume we have some automatic event tracking in place that logs every click, page view, and form submission so that we can analyze it later. We might want to have an internal dashboard that displays some high-value metrics, such as the number of weekly signups or the amount of revenue we are collecting per day. The events that are relevant to this dashboard make up a small fraction of this table — a very small percentage of clicks on your website are to finalize purchases! But they are mixed in with the rest of the table, so our signal-to-noise ratio is low.
We would like to index our data to make these dashboard queries fast.[1] Let’s start with a signup event, which we define to be a form submission on our /signup/ page. Getting the number of such signups in the first week of September translates to:
SELECT COUNT(*)
FROM event
WHERE
(data->>'type') = 'submit' AND
(data->>'path') = '/signup/' AND
time BETWEEN 1409554800000 AND 1410159600000
On a synthetic dataset with 10 million events, of which 3000 are signups, and without any indexes, this query takes 45 seconds.
A naive way to improve this performance is by creating single-column indexes for each of the relevant event features: (data->>‘type’), (data->>‘path’), and time. We can use a bitmap join between results from three indexed scans, which should be fast if the query is selective and the relevant index portions are in memory. Indeed, with these indexes in place, this query takes 200 ms initially, and 20 ms in subsequent runs on our synthetic dataset — a significant improvement over the 45 seconds required by a sequential scan.
But this indexing strategy has a few significant downsides:
We’re only analyzing the 0.03% of the table that constitutes signup events, but this strategy indexes all the rows. We want to be able to run efficient queries over a tiny fraction of a table. In a case like this, a partial index is going to give us the best results.
If we index an unrelated column and restrict our index to events that match our signup definition, PostgreSQL can easily determine where the signup event rows are, and we can query over them even more efficiently than with three complete indexes on the relevant fields. In particular, consider indexing the time field, but only on the rows that match the filter for signup events. That is:
CREATE INDEX event_signups ON event (time)
WHERE (data->>'type') = 'submit' AND (data->>'path') = '/signup/'
Using this index, our test query runs in 200ms initially and 2ms on subsequent runs, so there’s a performance improvement to be had if we are running this query often. More importantly, the partial index addresses all three disadvantages of the triple-index approach described above.
Another approach I’ve seen is to attempt to index the boolean expression
(data->>'type') = 'submit' AND (data->>'path') = '/signup/'
directly, with a second term for time. I.e., something like:
CREATE INDEX event_signup_time ON event
(((data->>'type') = 'submit' AND (data->>'path') = '/signup/'), time)
This is worse than either of the approaches above, as it turns out PostgreSQL’s query planner won’t figure out that our example query restricts our result to rows for which the first term in the index is true. That is, the planner doesn’t know that the WHERE clause:
WHERE (data->>'type') = 'submit' AND (data->>'path') = '/signup/'
is equivalent to saying that the
((data->>'type') = 'submit' AND (data->>'path') = '/signup/')
field of our index must be true. So it uses this index to restrict the time range of events in our scan, but it reads events for which the boolean expression is true as well as the events for which it is false, and then checks that condition for each row after loading it.[4]
So, we’re reading a lot more rows from disk than necessary, and also executing a nontrivial check for each one. On our synthetic dataset, this query takes 25 seconds for the first run and 8 seconds on successive executions. Under the hood, this is slightly worse than just indexing on the time field, and it performs comparably to doing just that.
Partial indexes are a powerful way to precompute the subset of a table that matches a predicate. Judging by traffic in the #postgresql IRC, they are considerably underused. Compared to full indexes, they allow a greater range of predicates. They can be dramatically more lightweight, requiring fewer writes and less disk space, especially for highly selective filters. Partial indexes should be your default strategy if you’re repeatedly querying a table for a small slice of rows.
Love PostgreSQL? Know of a feature the world should know more about? Ping me @danlovesproofs.
Interested in building systems that make powerful technology easy to use? Shoot us a note at jobs@heapanalytics.com.
[1] We might deal with this kind of a situation with table partitioning, in which high-value events and low-value events can live in separate child tables, but this approach is going to be unwieldy if there are many different kinds of high-value events, and we would need to repartition the table every time we wanted to add a new type of high-value event.
[2] We may get some of the updates for free via the heap-only tuples optimization, but, at least, every INSERT and DELETE will require writing to all three indexes.
[3] We could index all three fields in one multi-column index, e.g. on ((data->>‘type’), (data->>‘path’), time). That would take up 755 mb, a 26% savings, but this is still large, and the other drawbacks still apply. What’s more, this index would be less applicable to other queries we might want to run on this data, so, if we’re supporting a few different kinds of high-value events, this might not end up saving us any space.
[4] The relevant query plan:
QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Aggregate (cost=820866.05..820866.06 rows=1 width=0)
-> Index Scan using event_signup_time on event (cost=0.43..820865.99 rows=27 width=0)
Index Cond: (("time" >= 1409554800000::bigint) AND ("time" <= 1410159600000::bigint))
Filter: (((data ->> 'type'::text) = 'submit'::text) AND ((data ->> 'path'::text) = '/signup/'::text))
(4 rows)
原文地址:http://blog.heapanalytics.com/speeding-up-postgresql-queries-with-partial-indexes/
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。