lucene-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bruno Roustant (Jira)" <>
Subject [jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding
Date Thu, 14 Nov 2019 16:33:00 GMT


Bruno Roustant commented on LUCENE-8920:

{quote}There were a few conflicts when backporting to branch_8x, so you might want to take
a second look
I verified also branch_8x, that seems good to me.

I created the follow up item for removing cachedRootArcs LUCENE-9049.

> Reduce size of FSTs due to use of direct-addressing encoding 
> -------------------------------------------------------------
>                 Key: LUCENE-8920
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael Sokolov
>            Priority: Minor
>             Fix For: 8.4
>         Attachments:
>          Time Spent: 4.5h
>  Remaining Estimate: 0h
> Some data can lead to worst-case ~4x RAM usage due to this optimization. Several ideas
were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, the size
increase we're seeing while building (or perhaps do a preliminary pass before building) in
order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I noticed that
arc metadata is pretty large in some cases (in the 10-20 bytes) which make gaps very costly.
Associating each label with a dense id and having an intermediate lookup, ie. lookup label
-> id and then id->arc offset instead of doing label->arc directly could save a lot
of space in some cases? Also it seems that we are repeating the label in the arc metadata
when array-with-gaps is used, even though it shouldn't be necessary since the label is implicit
from the address?

This message was sent by Atlassian Jira

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message