lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-3297) FST doesn't fully share common prefix across all outputs
Date Mon, 08 Aug 2011 10:10:27 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13080878#comment-13080878
] 

Michael McCandless commented on LUCENE-3297:
--------------------------------------------

We sort of have this today, in the root arc (FST.getFirstArc), but it "avoids" eps by not
setting the arc's label at all, ie you're only allowed/expected to use this arc's target state,
output/nextFinalOutput.  This arc is how the consumer of the FST API "gets started" in accessing
the FST.

Adding eps label would make me nervous :)  For our FST impl (limited because we only support
determinized FSTs) we'd never see an eps transition anywhere else right?  Ie, it'd only be
for the root arc.

So the FST already today can represent a globally shared output prefix; the challenge in fixing
this issue is to fix the Builder impl to be able to push the output all the way back onto
this root arc; the phase where we push outputs as far back as possible doesn't push far enough...

> FST doesn't fully share common prefix across all outputs
> --------------------------------------------------------
>
>                 Key: LUCENE-3297
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3297
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/FSTs
>            Reporter: Michael McCandless
>            Priority: Minor
>
> FST will try to share prefixes of outputs when possible, however in the [I think unusual
in practice] case where all outputs share a common prefix, FST really ought to store this
just once, on the root arc, but instead it's only able to push back to the N root arcs.  It's
sort of an off-by-one on how far back the pushing goes...
> One [synthetic] example where this makes a big difference is the new Test2BPostings test,
when it uses MemoryCodec, because this test has 26 terms (letters of alphabet) and each term
has exactly the same long (~85 MB) all 1s byte[] as the postings.  If we fixed this issue,
then the resulting FST would only be ~85 MB but now instead it needs to be ~85 * 26 MB.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message