lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] [Commented] (LUCENE-3297) FST doesn't fully share common prefix across all outputs
Date Tue, 09 Aug 2011 10:43:27 GMT


Michael McCandless commented on LUCENE-3297:

If indeed we can make the code more generic and not lose (too much)
perf then that would be awesome... I'm just having trouble seeing how
adding explicit <eps> label will be more generic since <eps> would
only (and, always) be used in exactly one special-cased place (the
root arc), I think?

I must be missing something in your proposal...

Or, are you suggesting we actually make a "before start" symbol (hmm,
the mirror image of FST.END_LABEL) and always forcefully/explicitly
insert this in front of every byte[] passed to Builder?  This would in
fact fix this issue, since Builder should push a global output prefix
onto that first arc... and then that first arc would become the FST's
root arc.

> FST doesn't fully share common prefix across all outputs
> --------------------------------------------------------
>                 Key: LUCENE-3297
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/FSTs
>            Reporter: Michael McCandless
>            Priority: Minor
> FST will try to share prefixes of outputs when possible, however in the [I think unusual
in practice] case where all outputs share a common prefix, FST really ought to store this
just once, on the root arc, but instead it's only able to push back to the N root arcs.  It's
sort of an off-by-one on how far back the pushing goes...
> One [synthetic] example where this makes a big difference is the new Test2BPostings test,
when it uses MemoryCodec, because this test has 26 terms (letters of alphabet) and each term
has exactly the same long (~85 MB) all 1s byte[] as the postings.  If we fixed this issue,
then the resulting FST would only be ~85 MB but now instead it needs to be ~85 * 26 MB.

This message is automatically generated by JIRA.
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message