lucene-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bruno Roustant (Jira)" <>
Subject [jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding
Date Fri, 08 Nov 2019 17:35:00 GMT


Bruno Roustant commented on LUCENE-8920:

It works. I removed the labels for direct-addressing, except the first label of the arc.
It makes direct-addressing encoding even more compact. Now we have 48% of the nodes with fixed
length arcs encoded with direct-encoding by using only +12% memory (instead of +23% with my
previous patch).

I also did several things:
The presence bits table is now a reused structure (does not create long[] anymore each time
we read an arc).
I touched more code, but I think the code is now a bit clearer with more comments.
I moved the bit twiddling methods to BitUtil.

Thanks reviewers!


I noticed the direct-addressing limitation: it is less performant than binary search for reverse
lookup (lookup by output).

Is there an periodic automated monitoring of the performance / memory of commits? Could you
give me the link? I would like to watch the change with this PR.


 I definitely would like to clean more FST. But that will be the subject of another Jira
issue later. Too much duplicated code that makes reasoning and modifications hard. Also missing
docs and method contracts (e.g. expected position in the byte input).

> Reduce size of FSTs due to use of direct-addressing encoding 
> -------------------------------------------------------------
>                 Key: LUCENE-8920
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael Sokolov
>            Priority: Minor
>         Attachments:
>          Time Spent: 3h 50m
>  Remaining Estimate: 0h
> Some data can lead to worst-case ~4x RAM usage due to this optimization. Several ideas
were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, the size
increase we're seeing while building (or perhaps do a preliminary pass before building) in
order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I noticed that
arc metadata is pretty large in some cases (in the 10-20 bytes) which make gaps very costly.
Associating each label with a dense id and having an intermediate lookup, ie. lookup label
-> id and then id->arc offset instead of doing label->arc directly could save a lot
of space in some cases? Also it seems that we are repeating the label in the arc metadata
when array-with-gaps is used, even though it shouldn't be necessary since the label is implicit
from the address?

This message was sent by Atlassian Jira

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message