lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: Avoid automaton Memory Usage
Date Thu, 08 Aug 2013 17:30:22 GMT
On Thu, Aug 8, 2013 at 12:54 PM, Anna Björk Nikulásdóttir
<> wrote:
> Am 8.8.2013 um 12:37 schrieb Michael McCandless <>:
>> <snip>
>>> What would help in my case as I use the same FST for both analyzers, if the same
FST object could be shared among both analyzers. So what I am doing is to use
and use the stored file for AnalyzingSuggester.load() and FuzzySuggester.load().
>> That's interesting ... so you mean you sometimes want fuzzy
>> suggestions and sometimes non-fuzzy ones, off the same built
>> suggester?  I believe AnalyzingSuggester and FuzzySuggester in fact
>> use the same FST (not certain) ... are you able to do
>> FuzzySuggester.load from a previous and it
>> works?  And that's still too much RAM?
> Yes it works like a charm.

That's good to know!

> I use it for auto completion of non english language terms. Often the typed beginning
of a term can be used as is and then AnlyzingSuggester gives best results, whereas FuzzySuggester
would give too many results that need a lot of post processing. If the user is lazy and because
the Android keyboard doesn't always provide easy access to specific letters, e.g. 'æ', 'ä',
'ß', etc. or if he mistypes some letters, I use FuzzySuggester as fallback if AnalyzingSuggester
doesn't yield appropriate results. It's a bit of a cludge because FuzzySuggester doesn't boost
minimal Levenstein-Distance terms.

This (not giving a better score for lookups that require fewer edits)
was a concern on the original FuzzySuggester issue ... can you open a
separate issue to explore this?  Really it should score
"appropriately", in which case maybe you could have just used
FuzzySuggester?  I don't know if anyone has time right now to work out
a patch but we should at least open the issue ...

> Performance wise this is absolutely no problem on Android, but memory wise it means 2x
the FST memory. Atm. 1 FST needs ~20MB. If e.g. I would like to simultanously support multiple
languages, it's not going to work this way.


> Ideally all this could be done on disk/flash only. But this then needs changes according
to your former proposal via DirectByteBuffer. Do you think going this way would yield acceptable
performance ? And does mapping a file into memory not fill the DRAM with the complete content
of the file over time ? Are "normal" Lucene indexes accessed this way ?

Well, we'd need to test performance.  Unfortunately access to the FST
is rather random-access, so unless the OS pulls the pages into RAM, ie
if the seeks are "cold", then performance will suffer.  But it could
be it's fine in your case.  But this (accessing FST from disk) is a
biggish change ...

>>> Unfortunately there is no immutable FST class, but as I do not use it in mulithreaded
environment, that is probably not a problem, no ? A quick fix could be to copy the analyzer
classes and change these to such behaviour and reuse the FST object. Does this make sense
functional wise or do I have to expect problems ?
>> Sharing an FST across analyzing and fuzzy suggesters does seem
>> worthwhile; it may "just work" today…
> I will try then. Do you have any evidence about if it could not work at some point in
the future ?

Can you also open a separate issue for this (allowing both fuzzy and
non-fuzzy access to one FST).  Today the formats are in fact
identical, but unless we make an effort to support this (it could be
as easy as accepting maxEdits=0 ... hmm, is this allowed / does it
"just work" today?) then they can easily diverge over time.  It's
crazy that you have to load the same FST twice today...

Maybe we just merge the two suggesters ... who knows :)  These classes
are all very new and experimental so we should feel free to do heavy

Mike McCandless

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message