lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dawid Weiss (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-3297) FST doesn't fully share common prefix across all outputs
Date Sun, 07 Aug 2011 17:52:27 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13080611#comment-13080611
] 

Dawid Weiss commented on LUCENE-3297:
-------------------------------------

Ok, I've found the issue I was looking for. Mike, this should be solved by changing the logical
organization of the fst. This would also solve the issue of storing an "empty" automaton and
its output (which is currently hardcoded in an unpleasant way). The solution is to introduce
an epsilon-labeled arc (epsilon meaning an out-of-alphabet symbol, possibly a flag). The "root"
of the fst would be an eps-labeled transition to the actual root node. Then, an empty automaton
is basically an eps-terminal transition and prefix sharing you talk about in this issue is
whatever can be pushed onto this eps-arc. The automaton remains deterministic, it's just a
small change of logic (and addition of the special eps symbol).

> FST doesn't fully share common prefix across all outputs
> --------------------------------------------------------
>
>                 Key: LUCENE-3297
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3297
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/FSTs
>            Reporter: Michael McCandless
>            Priority: Minor
>
> FST will try to share prefixes of outputs when possible, however in the [I think unusual
in practice] case where all outputs share a common prefix, FST really ought to store this
just once, on the root arc, but instead it's only able to push back to the N root arcs.  It's
sort of an off-by-one on how far back the pushing goes...
> One [synthetic] example where this makes a big difference is the new Test2BPostings test,
when it uses MemoryCodec, because this test has 26 terms (letters of alphabet) and each term
has exactly the same long (~85 MB) all 1s byte[] as the postings.  If we fixed this issue,
then the resulting FST would only be ~85 MB but now instead it needs to be ~85 * 26 MB.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message