lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-2888) FSTSuggester refactoring: utf8 storage, external sorts (OOM prevention), code cleanups
Date Thu, 01 Dec 2011 10:41:40 GMT

    [ https://issues.apache.org/jira/browse/SOLR-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160800#comment-13160800
] 

Robert Muir commented on SOLR-2888:
-----------------------------------

looks good, a few nits:
* bytesequencesreader is complementary to itself
* externalrefsorter.close shouldn't mask exceptions i dont think? caller can do this in a
try/catch
* same with the new save()/read() methods added to FST

                
> FSTSuggester refactoring: utf8 storage, external sorts (OOM prevention), code cleanups
> --------------------------------------------------------------------------------------
>
>                 Key: SOLR-2888
>                 URL: https://issues.apache.org/jira/browse/SOLR-2888
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>             Fix For: 4.0
>
>         Attachments: SOLR-2888.patch, SOLR-2888.patch
>
>
> This issue incorporates several problems:
> - utf16 was used previously to store and lookup terms, now it is utf8
> - the construction would OOM with large number of terms because of the need to sort entries.
Sorter APIs have been added and an implementation of external (on-disk) sorting is also added
(Robert Muir).
> - the FSTLookup class has been split and refactored into FSTCompletion and FSTCompletionBuilder,
FSTCompletionLookup remains a facade connecting all the pieces and implements Lookup interface.
For large inputs use FSTCompletionBuilder directly (and pre-bucket your input weights).
> - Automatic bucketing in FSTCompletionLookup has been changed from linear min/max discretization
into dividing into  ranges after all values have been sorted. This empirically handles all
potential distributions quite well. If somebody needs something very specific, use FSTCompletionBuilder
directly (providing buckets), construct the automaton and then load it with FSTCompletionLookup.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message