lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-3233) HuperDuperSynonymsFilterâ„¢
Date Wed, 06 Jul 2011 17:29:16 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060705#comment-13060705
] 

Robert Muir commented on LUCENE-3233:
-------------------------------------

{quote}
The difference in build time is surprising to me. Any theory why SynonymFilterFactory takes
so much more time to build?
{quote}

Yes, its the n^2 portion where you have a synonym entry like this: a, b, c, d
in reality this is creating entries like this:
a -> a
a -> b
a -> c
a -> d
b -> a
b -> b
...

in the current impl, this is done using some inefficient datastructures (like nested chararraymaps
with Token),
as well as calling merge().

In the FST impl, we don't use any nested structures (instead input and output entries are
just phrases), and we explicitly 
deduplicate both inputs and outputs during construction, the FST output is just a
List<Integer> basically pointing to ords in the deduplicated bytesrefhash.

so during construction when you add() its just a hashmap lookup on the input phrase, a bytesrefhash
get/put on the UTF16toUTF8WithHash
to get the output ord, and an append to an arraylist.

this code isn't really optimized right now and we can definitely speed it up even more in
the future. but the main thing
right now is to ensure the filter performance is good.


> HuperDuperSynonymsFilterâ„¢
> -------------------------
>
>                 Key: LUCENE-3233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3233
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-3223.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch,
LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch, LUCENE-3233.patch,
synonyms.zip
>
>
> The current synonymsfilter uses a lot of ram and cpu, especially at build time.
> I think yesterday I heard about "huge synonyms files" three times.
> So, I think we should use an FST-based structure, sharing the inputs and outputs.
> And we should be more efficient with the tokenStream api, e.g. using save/restoreState
instead of cloneAttributes()

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message