[ https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973993#comment-16973993
]
Adrien Grand commented on LUCENE-8920:
--------------------------------------
This gives a nice bump on the PKLookup task http://people.apache.org/~mikemccand/lucenebench/PKLookup.html.
Almost on par with the previous direct addressing patch.
> Reduce size of FSTs due to use of direct-addressing encoding
> -------------------------------------------------------------
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Michael Sokolov
> Priority: Minor
> Attachments: TestTermsDictRamBytesUsed.java
>
> Time Spent: 4.5h
> Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. Several ideas
were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, the size
increase we're seeing while building (or perhaps do a preliminary pass before building) in
order to decide whether to apply the encoding.
> bq. we could also make the encoding a bit more efficient. For instance I noticed that
arc metadata is pretty large in some cases (in the 10-20 bytes) which make gaps very costly.
Associating each label with a dense id and having an intermediate lookup, ie. lookup label
-> id and then id->arc offset instead of doing label->arc directly could save a lot
of space in some cases? Also it seems that we are repeating the label in the arc metadata
when array-with-gaps is used, even though it shouldn't be necessary since the label is implicit
from the address?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
|