lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-3233) HuperDuperSynonymsFilterâ„¢
Date Wed, 06 Jul 2011 10:54:16 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060471#comment-13060471
] 

Michael McCandless commented on LUCENE-3233:
--------------------------------------------

bq. java.lang.IllegalStateException: max arc size is too large (445)

Ahh -- to fix this we have to call Builder.setAllowArrayArcs(false), ie, disable the array
arcs in the FST (and this binary search lookup for finding arcs!).  I had to do this also
for MemoryCodec, since postings encoded as output per arc can be more than 256 bytes, in general.

This will hurt perf, ie, the arc lookup cannot use a binary search; it's because of a silly
limitation in the FST representation, that we use a single byte to hold the max size of all
arcs, so that if any arc is > 256 bytes we are unable to encode it as an array.  We could
fix this (eg, use vInt), however, arcs with such widely varying sizes (due to widely varying
outputs on each arc) will be very wasteful in space because all arcs will use up a fixed number
of bytes when represented as an array.

For now I think we should just call the above method, and then test the resulting perf.

> HuperDuperSynonymsFilterâ„¢
> -------------------------
>
>                 Key: LUCENE-3233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3233
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch,
LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, synonyms.zip
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and outputs.
> And we should be more efficient with the tokenStream api, e.g. using save/restoreState
instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message