lucene-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Sokolov (Jira)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding
Date Thu, 14 Nov 2019 16:35:00 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16974410#comment-16974410
] 

Michael Sokolov commented on LUCENE-8920:
-----------------------------------------

I had tested with the previous version of this patch, and yes I also believe this preserves
the same back-compat since the old arc encoding is read as before, but there is no automated
testing to verify. It would be wise to run some manual spot-checking. We could eg build an
"old" index with luceneutil and then run its tests with that index after upping the code.
Or any test that runs on an existing index should do - is there a more convenient one? 

> Reduce size of FSTs due to use of direct-addressing encoding 
> -------------------------------------------------------------
>
>                 Key: LUCENE-8920
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8920
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael Sokolov
>            Priority: Minor
>             Fix For: 8.4
>
>         Attachments: TestTermsDictRamBytesUsed.java
>
>          Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. Several ideas
were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, the size
increase we're seeing while building (or perhaps do a preliminary pass before building) in
order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I noticed that
arc metadata is pretty large in some cases (in the 10-20 bytes) which make gaps very costly.
Associating each label with a dense id and having an intermediate lookup, ie. lookup label
-> id and then id->arc offset instead of doing label->arc directly could save a lot
of space in some cases? Also it seems that we are repeating the label in the arc metadata
when array-with-gaps is used, even though it shouldn't be necessary since the label is implicit
from the address?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Mime
View raw message