-
Notifications
You must be signed in to change notification settings - Fork 58
int/bigint instead of timestamps #15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
On Mon, Apr 3, 2017 at 12:36 PM, Nikolay ***@***.***> wrote:
As we discussed at the recent Moscow meetup, for many cases it'd be very
good to be able to use RUM to keep int/bigint information and speed up
queries like
SELECT ... WHERE tsvector @@ ... ORDER BY price
Similar request from the -general: http://www.postgresql-archive.
org/RUM-index-and-support-for-storing-BIGINT-as-part-of-a-
tsvector-timestamp-tt5938725.html#a5942457
yes, it's possible. We probably will implement something like we did for
btree_gin - generate a bunch of opclasses for all built-in data types.
… —
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#15>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGFI4suxoz0xPaHOo1CZnLN4zarOwGyCks5rsSAJgaJpZM4MxyhH>
.
|
And if rum keeps int/bigint information would queries like |
I'm so glad to see that this will be addressed! Note that what I asked about in the referred email was not BIGINT instead of TIMESTAMP, but in addition to. Will storing TSVECTOR and both BIGINT and TIMESTAMP (for ordering) in the same RUM-index be possible, so we can have this:
and have it use one index only? Thanks. |
@andreak really good example, I also see lots of similar cases |
I'm not quite sure where to post questions about RUM so I've posted another question to -general: |
I think both ways are appropriate. Oleg Bartunov and me are monitoring -general. Also you can create new issues here. |
Ah, good to know, thanks. |
Hi!
I checked your example and what I want to say:
ANY clause is only supported by bitmap scan. Obviously, bitmap scan cannot
return data in order desired by ORDER BY clause - sort should be performed.
If you don't use ANY clause postgres will be able execute all conditions in index:
EXPLAIN (analyze on, COSTS OFF)
SELECT
del.received_timestamp,
del.received_timestamp <=> '3000-01-01' :: TIMESTAMP,
del.folder_id
FROM email_delivery del
WHERE del.fts_all @@ to_tsquery('simple', 'andre:*&jose:*') AND del.folder_id =
44965::bigint
ORDER BY del.received_timestamp <=> '3000-01-01' :: TIMESTAMP
;
PLAN
----------------------------------------------------------------------------------------------------
Index Scan using rum_idx on email_delivery del (actual time=0.919..0.921
rows=1 loops=1)
Index Cond: ((fts_all @@ '''andre'':* & ''jose'':*'::tsquery) AND (folder_id
= '44965'::bigint))
Order By: (received_timestamp <=> '3000-01-01 00:00:00'::timestamp without
time zone)
But notice, use explicite cast of numeric constant to type of folder_id. In
future it could be fixed by adding set of comparison operations with different
types on left and right side to RUM's opclasses.
Next example:
EXPLAIN (analyze on, COSTS OFF)
SELECT
del.received_timestamp,
del.received_timestamp <=> '3000-01-01' :: TIMESTAMP,
del.folder_id
FROM email_delivery del
WHERE del.folder_id in ( 44965::bigint, 2470520)
;
QUERY PLAN
----------------------------------------------------------------------------------
Bitmap Heap Scan on email_delivery del (actual time=0.208..0.210 rows=1 loops=1)
Recheck Cond: (folder_id = ANY ('{44965,2470520}'::bigint[]))
Heap Blocks: exact=1
-> Bitmap Index Scan on rum_idx (actual time=0.186..0.186 rows=1 loops=1)
Index Cond: (folder_id = ANY ('{44965,2470520}'::bigint[]))
Work as expected, but it's easy to prove that actually it does two index scans
in one bitmap. So, it works, but:
set enable_indexscan=off;
EXPLAIN (analyze on, COSTS OFF)
SELECT
del.received_timestamp,
del.received_timestamp <=> '3000-01-01' :: TIMESTAMP,
del.folder_id
FROM email_delivery del
WHERE del.fts_all @@ to_tsquery('simple', 'andre:*&jose:*') AND del.folder_id
in ( 44965::bigint, 2470520)
;
WARNING: rumbeginscan 1:0
WARNING: rumrescan 1
QUERY PLAN
----------------------------------------------------------------------------------
Bitmap Heap Scan on email_delivery del (actual time=0.108..0.110 rows=1 loops=1)
Recheck Cond: (fts_all @@ '''andre'':* & ''jose'':*'::tsquery)
Filter: (folder_id = ANY ('{44965,2470520}'::bigint[]))
Heap Blocks: exact=1
-> Bitmap Index Scan on rum_idx (actual time=0.080..0.080 rows=1 loops=1)
Index Cond: (fts_all @@ '''andre'':* & ''jose'':*'::tsquery)
Here I'm not sure why postgres does not push ANY clause to index scan but I
suppose it thinks that it's too expensive to do two index scans with fts clause.
Andreas Joseph Krogh wrote:
… I'm not quite sure where to post questions about RUM so I've posted another
question to -general:
http://www.postgresql-archive.org/How-to-include-BIGINT-column-in-RUM-index-sorted-by-timestamp-td5961689.html
Is posting to -general the "correct" way to ask questions about RUM? (creating
new issues here on github seems strange)
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#15 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAz9vN0sMZQvmyOiSYzTOx_FMjblegE3ks5r6HGQgaJpZM4MxyhH>.
--
Teodor Sigaev E-mail: [email protected]
WWW: http://www.sigaev.ru/
|
Hi Feodor. Thanks for the hint about the extra casting, now it works at least with equals on folder_id, as you said: EXPLAIN ANALYZE
SELECT
del.entity_id,
del.received_timestamp,
del.received_timestamp <=> '3000-01-01' :: TIMESTAMP,
del.folder_id
FROM origo_email_delivery del
WHERE del.fts_all @@ to_tsquery('simple', 'andre:*&jose:*')
AND del.folder_id = 2470520::BIGINT
ORDER BY del.received_timestamp <=> '3000-01-01' :: TIMESTAMP
LIMIT 10
Sadly, it doesn't seem to be faster than GIN with Sort-step. create index origo_email_delivery_fts_all_folder_idx ON origo_email_delivery using gin (fts_all, folder_id)
EXPLAIN ANALYZE
SELECT
del.entity_id,
del.received_timestamp,
del.received_timestamp <=> '3000-01-01' :: TIMESTAMP,
del.folder_id
FROM origo_email_delivery del
WHERE del.fts_all @@ to_tsquery('simple', 'andre:*&jose:*')
AND del.folder_id = 2470520::BIGINT
ORDER BY del.received_timestamp DESC
Any comments on why GIN is faster? |
EXPLAIN ANALYZE
SELECT
del.entity_id,
del.received_timestamp,
del.received_timestamp <=> '3000-01-01' :: TIMESTAMP,
del.folder_id
FROM origo_email_delivery del
WHERE del.fts_all @@ to_tsquery('simple', 'andre:*&jose:*')
AND del.folder_id = 2470520::BIGINT
ORDER BY del.received_timestamp <=> '3000-01-01' :: TIMESTAMP
LIMIT 10
Sadly, it doesn't seem to be faster than GIN with Sort-step.
I haven't time to deep check today, but, seems, it's because of prefix search in
tsquery. Pls, compare without it. When you use prefix search in tsquery, RUM
should combine results from several lists (andre, andres, andreas etc) and it
could not merge results in correct order. Prefix search also should remove
duplicates from result: document could contain more than one matching word. So,
RUM actually does sort internally and then removes duplicates. GIN could be
faster in this case because bitmap is for effective way for removing than sort
and unique.
Will look closer ASAP.
…--
Teodor Sigaev E-mail: [email protected]
WWW: http://www.sigaev.ru/
|
Hm, this query (without prefix search) runs forever (using pg-10-master as of 2df537e43fdc432cccbe64de166ac97363cbca3c and RUM as of 9ac931a): EXPLAIN ANALYZE
SELECT
del.entity_id,
del.received_timestamp,
del.received_timestamp <=> '3000-01-01' :: TIMESTAMP,
del.folder_id
FROM origo_email_delivery del
WHERE del.fts_all @@ to_tsquery('simple', 'andreas&joseph')
AND del.folder_id = 2470520::BIGINT
ORDER BY del.received_timestamp <=> '3000-01-01' :: TIMESTAMP
LIMIT 10 Taking 100% CPU. How shuold I go about debugging this? |
Andreas Joseph Krogh wrote:
Hm, this query (without prefix search) runs forever (using pg-10-master as of
Huh, surprise...
How shuold I go about debugging this?
pls, provide several backtraces with gdb
…--
Teodor Sigaev E-mail: [email protected]
WWW: http://www.sigaev.ru/
|
Does this help?
...wait a bit...
...wait a bit more...
...wait a bit more...
...wait a bit more...
Hope this helps |
Thank you for the information! |
Works perfectly, thanks! |
To answer Teodor's question; EXPLAIN ANALYZE
SELECT
del.entity_id,
del.received_timestamp,
del.received_timestamp <=> '3000-01-01' :: TIMESTAMP,
del.folder_id
FROM origo_email_delivery del
WHERE del.fts_all @@ to_tsquery('simple', 'andre:*&jose:*')
AND del.folder_id = 2470520::BIGINT
ORDER BY del.received_timestamp <=> '3000-01-01' :: TIMESTAMP
LIMIT 10
without prefix: EXPLAIN ANALYZE
SELECT
del.entity_id,
del.received_timestamp,
del.received_timestamp <=> '3000-01-01' :: TIMESTAMP,
del.folder_id
FROM origo_email_delivery del
WHERE del.fts_all @@ to_tsquery('simple', 'andreas&joseph')
AND del.folder_id = 2470520::BIGINT
ORDER BY del.received_timestamp <=> '3000-01-01' :: TIMESTAMP
LIMIT 10
The question now of course is will it ever be possible, theoretically, for the prefix-variant to be faster, or is there some fundamental limit or logic-barrier preventing it from ever will be? |
Great! About your question. It is because of algorithm of scanning by prefix query. Maybe it could be improved. We need to investigate it. |
Thanks. I think this issue can be closed now, and any other issues/questions I might have can be asked in new issues? |
You are right. It is better to create new issues. |
Actually, the initial issue from NikolayS doesn't fixed. Reopen. |
So is it possible now to filter with @@ operator and ORDER BY some integer/int8 column and have only single index scan for that? |
Thanks:) |
Yes, it will be possible in soon :) |
@NikolayS , actually you can use the following query. create index on tab1 using rum (tsv rum_tsvector_addon_ops, price)
with (attach=price, to=tsv); you can execute the following query: select ... from tab1 where tsv @@ ...
order by price <=> 10000::money limit 10; It will use index scan. Some distance operator (for example, <=>) should be used to use index scan. Otherwise, bitmap scan will be used. |
Hi. Will this be re-worked in future releases so one can write |
Isn't this issue actually fixed now, using rum_tsvector_addon_ops? |
Yes, you can use rum_tsvector_addon_ops now. Thank you for noticing. |
ok, thank you, but could you please elaborate how it's supposed to work?
– I asked only for top 10 records, but rum idx scan takes all 50k records with word "cow" ( Without
Should it be so? Is it really different from what it was before? |
Oh. Understood :-)
-- looks good! |
Unfortunately, ordering within index can be performed only using some value to compare. Please try out the query: $ select * from rum where tsvector @@ plainto_tsquery('simple', 'cow') order by id <=> 1000000 limit 10;
$ select * from rum where tsvector @@ plainto_tsquery('simple', 'cow') order by id <=> 0 limit 10; Depending on the compared value results a different. |
Yep :) |
One more question. In my tests, 'asc' ordering ( Any ways to improve it somehow? (using different index definition or something) |
In your second query
Use maximum integer value instead. |
I suppose maximum value is 2147483647 |
Okay, used 10^5, but it didn't help to eliminate the difference between "asc" and "desc":
How can I get "asc" speed for "desc" (say, if I need desc in most cases)? |
oh, when I do |
Oh. This is my usual mistake – I shouldn't use
-- "asc" is almost 2x faster than "desc" (18ms vs 31ms). Is it possible to have the faster "desc"? |
I understood what's happening. The exponentiation operator returns result in double or numeric type:
So you need to cast the result $ explain (analyze) select * from rum where tsvector @@ 'cow'::tsquery order by id <=> (10^6)::int limit 10;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=12.00..13.43 rows=10 width=56) (actual time=21.027..21.031 rows=10 loops=1)
-> Index Scan using rum_tsvector_id_idx on rum (cost=12.00..7171.81 rows=50125 width=56) (actual time=21.024..21.028 rows=10 loops=1)
Index Cond: (tsvector @@ '''cow'''::tsquery)
Order By: (id <=> 1000000)
Planning time: 0.181 ms
Execution time: 21.274 ms About |
Yes;
|
Yeah, yeah, I always do that mistake. For years. Никогда такого не было и вот опять. But what is more interesting/relevant here:
|
Good questions, but don't hijack this issue. Start new separate issues for those. |
No-no, this is my issue and those questions are both belong to it for sure :)) Repo owners can move them to other places if they think it's better. |
I think nobody likes never-ending issues due to new stuff creaping in. |
@andreak it's all offtopic and provocative. Having all information in one single place regarding on particular topic can be very helpful and might save time. It's up to the repo owner how to manage issues – it depends on concrete process inside the particular dev team. Don't pretend that your practices are the only possible and right choice, it's not a good citizen's behavior. Consider this matter as closed, leave these questions to repo owners and let's focus on tech stuff. |
My point was; let's keep this issue closed. Glad we agree on that:-) |
As we discussed at the recent Moscow meetup, for many cases it'd be very good to be able to use RUM to keep int/bigint information and speed up queries like
Similar request from the -general: http://www.postgresql-archive.org/RUM-index-and-support-for-storing-BIGINT-as-part-of-a-tsvector-timestamp-tt5938725.html#a5942457
The text was updated successfully, but these errors were encountered: