lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eks Dev (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2482) Index sorter
Date Thu, 27 May 2010 21:43:36 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12872386#action_12872386
] 

Eks Dev commented on LUCENE-2482:
---------------------------------

Re: I'm not sure if I follow your use case though

Simple case, you have a 100Mio docs with 2 fields, CITY and  TEXT

sorting on CITY makes postings look like: 
    Orlando:                  ---------------------------------
 New York:                                                               -------------------------------------
perfectly compressible. 

without really affecting distribution (compressibility) of terms from the TEXT field.

If CITY would remain in unsorted order (e.g. uniform distribution), you deal with very large
postings for all terms coming from this field  

Sorting on many fields helps often, e.g. if you have hierarchical compositions like 1 CITY
with many  ZIP_CODES...  philosophically, sorting always increases compressibility and improves
locality of reference... but sure, you need to know what you want

> Index sorter
> ------------
>
>                 Key: LUCENE-2482
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2482
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>    Affects Versions: 3.1
>            Reporter: Andrzej Bialecki 
>             Fix For: 3.1
>
>         Attachments: indexSorter.patch
>
>
> A tool to sort index according to a float document weight. Documents with high weight
are given low document numbers, which means that they will be first evaluated. When using
a strategy of "early termination" of queries (see TimeLimitedCollector) such sorting significantly
improves the quality of partial results.
> (Originally this tool was created by Doug Cutting in Nutch, and used norms as document
weights - thus the ordering was limited by the limited resolution of norms. This is a pure
Lucene version of the tool, and it uses arbitrary floats from a specified stored field).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message