lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (Updated) (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SOLR-2968) Hunspell very high memory use when loading dictionary
Date Thu, 15 Dec 2011 20:01:31 GMT

     [ https://issues.apache.org/jira/browse/SOLR-2968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Robert Muir updated SOLR-2968:
------------------------------

    Attachment: patch.txt

here's a patch cutting this thing over to use less ram once its started. but it probably uses
more initially when parsing, mainly because we cannot guarantee the input is in sorted order.
I think we should fix that, so that jumping thru hoops is the exception rather than the rule:
* we allow multiple dictionary files... is this really needed?
* if you use ignoreCase it means entries can be out of sorted order too.
* in some strange encodings the order in the original file could differ from binary order.

the building could just do the 2-phase thing it does now for the crazy cases and be efficient
for the 90% case if we clean up.

The remaining problems:
* fix existing confusion in the dictionary api (like multiple input files) so that most of
the time we can rely upon sorted order.
* solr should never instantiate more than one of the same dictionary across different fields
(thats a factory issue, i'm not going to deal with it here, but its just stupid if the factory
does this).
* anything in the patch with nocommit, TODO, or bogus should be fixed.

                
> Hunspell very high memory use when loading dictionary
> -----------------------------------------------------
>
>                 Key: SOLR-2968
>                 URL: https://issues.apache.org/jira/browse/SOLR-2968
>             Project: Solr
>          Issue Type: Bug
>    Affects Versions: 3.5
>            Reporter: Maciej Lisiewski
>            Priority: Minor
>         Attachments: patch.txt
>
>
> Hunspell stemmer requires gigantic (for the task) amounts of memory to load dictionary/rules
files. 
> For example loading a 4.5 MB polish dictionary (with empty index!) will cause whole core
to crash with various out of memory errors unless you set max heap size close to 2GB or more.
> By comparison Stempel using the same dictionary file works just fine with 1/8 of that
(and possibly lower values as well).
> Sample error log entries:
> http://pastebin.com/fSrdd5W1
> http://pastebin.com/Lmi0re7Z

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message