lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dawid Weiss (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-2761) FSTLookup should use long-tail like discretization instead of proportional (linear)
Date Wed, 16 Nov 2011 12:20:51 GMT

    [ https://issues.apache.org/jira/browse/SOLR-2761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13151160#comment-13151160
] 

Dawid Weiss commented on SOLR-2761:
-----------------------------------

Brainstorming discussions with Robert and Simon who had real use cases. The outcome is that
discretization into buckets will be problematic to "get right" in the general case. The distribution
of weight functions may require custom tweaks and tunings that should best be done before
weights are added to the FSTLookup. An explicit API of the form add(term, int bucket) will
be added, with an adapter over TermFreqIterator to do min/max (value range) or long-tail (sorted
input) bucketing. These adapters will be more costly as they may require additional passes
over the data or re-sorting of the input data. The add(term, int bucket) will be cheap(er)
with only a single sort required.
                
> FSTLookup should use long-tail like discretization instead of proportional (linear)
> -----------------------------------------------------------------------------------
>
>                 Key: SOLR-2761
>                 URL: https://issues.apache.org/jira/browse/SOLR-2761
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>    Affects Versions: 3.4
>            Reporter: David Smiley
>            Assignee: Dawid Weiss
>            Priority: Minor
>             Fix For: 3.5, 3.6, 4.0
>
>
> The Suggester's FSTLookup implementation discretizes the term frequencies into a configurable
number of buckets (configurable as "weightBuckets") in order to deal with FST limitations.
The mapping of a source frequency into a bucket is a proportional (i.e. linear) mapping from
the minimum and maximum value. I don't think this makes sense at all given the well-known
long-tail like distribution of term frequencies. As a result of this problem, I've found it
necessary to increase weightBuckets substantially, like >100, to get quality suggestions.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message