lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Agnieszka (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-3245) Poor performance of Hunspell with Polish Dictionary
Date Wed, 14 Mar 2012 13:34:38 GMT

    [ https://issues.apache.org/jira/browse/SOLR-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13229180#comment-13229180
] 

Agnieszka commented on SOLR-3245:
---------------------------------

I made one more test for Hunspell with english dictionary (from OpenOffice.org) in Solr 4.0.
It seems that the problem not exists with the english dictionary.

Solr 4.0, full import 489017 documents, hunspell, english dictionary:

3146 seconds, 155 docs/sec


But I'm not sure if it is reliable because I use documents with polish text to test english
dictionary.
                
> Poor performance of Hunspell with Polish Dictionary
> ---------------------------------------------------
>
>                 Key: SOLR-3245
>                 URL: https://issues.apache.org/jira/browse/SOLR-3245
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 4.0
>         Environment: Centos 6.2, kernel 2.6.32, 2 physical CPU Xeon 5606 (4 cores each),
32 GB RAM, 2 SSD disks in RAID 0, java version 1.6.0_26, java settings -server -Xms4096M -Xmx4096M

>            Reporter: Agnieszka
>              Labels: performance
>         Attachments: pl_PL.zip
>
>
> In Solr 4.0 Hunspell stemmer with polish dictionary has poor performance whereas performance
of hunspell from http://code.google.com/p/lucene-hunspell/ in solr 3.4 is very good. 
> Tests shows:
> Solr 3.4, full import 489017 documents:
> StempelPolishStemFilterFactory -  2908 seconds, 168 docs/sec 
> HunspellStemFilterFactory - 3922 seconds, 125 docs/sec
> Solr 4.0, full import 489017 documents:
> StempelPolishStemFilterFactory - 3016 seconds, 162 docs/sec 
> HunspellStemFilterFactory - 44580 seconds (more than 12 hours), 11 docs/sec
> My schema is quit easy. For Hunspell I have one text field I copy 14 text fields to:
> {code:xml}
> "<field name="text" type="text_pl_hunspell" indexed="true" stored="false" multiValued="true"/>"
> <copyField source="field1" dest="text"/>  
> ....
> <copyField source="field14" dest="text"/>
> {code}
> The "text_pl_hunspell" configuration:
> {code:xml}
> <fieldType name="text_pl_hunspell" class="solr.TextField" positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="dict/stopwords_pl.txt"
>                 enablePositionIncrements="true"
>                 />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.HunspellStemFilterFactory" dictionary="dict/pl_PL.dic"
affix="dict/pl_PL.aff" ignoreCase="true"
>         <!--filter class="solr.KeywordMarkerFilterFactory" protected="protwords_pl.txt"/-->
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.SynonymFilterFactory" synonyms="dict/synonyms_pl.txt"
ignoreCase="true" expand="true"/>
>         <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="dict/stopwords_pl.txt"
>                 enablePositionIncrements="true"
>                 />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.HunspellStemFilterFactory" dictionary="dict/pl_PL.dic"
affix="dict/pl_PL.aff" ignoreCase="true"
>         <filter class="solr.KeywordMarkerFilterFactory" protected="dict/protwords_pl.txt"/>
>       </analyzer>
>     </fieldType>
> {code}
> I use Polish dictionary (files stopwords_pl.txt, protwords_pl.txt, synonyms_pl.txt are
empy)- pl_PL.dic, pl_PL.aff. These are the same files I used in 3.4 version. 
> For Polish Stemmer the diffrence is only in definion text field:
> {code}
> "<field name="text" type="text_pl" indexed="true" stored="false" multiValued="true"/>"
>     <fieldType name="text_pl" class="solr.TextField" positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="dict/stopwords_pl.txt"
>                 enablePositionIncrements="true"
>                 />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.StempelPolishStemFilterFactory"/>
>         <filter class="solr.KeywordMarkerFilterFactory" protected="dict/protwords_pl.txt"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.SynonymFilterFactory" synonyms="dict/synonyms_pl.txt"
ignoreCase="true" expand="true"/>
>         <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="dict/stopwords_pl.txt"
>                 enablePositionIncrements="true"
>                 />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.StempelPolishStemFilterFactory"/>
>         <filter class="solr.KeywordMarkerFilterFactory" protected="dict/protwords_pl.txt"/>
>       </analyzer>
>     </fieldType>
> {code}
> One document has 23 fields:
> - 14 text fields copy to one text field (above) that is only indexed
> - 8 other indexed fields (2 strings, 2 tdates, 3 tint, 1 tfloat) The size of one document
is 3-4 kB.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message