lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2482) Index sorter
Date Sun, 16 Jan 2011 22:54:45 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12982411#action_12982411
] 

Robert Muir commented on LUCENE-2482:
-------------------------------------

bq. I'm not sure if I follow your use case though ... please remember that this re-sorting
is applied exactly the same to all postings, so savings on one list may cause bloat on another
list.

Hi Andrzej, I came across this the other day, and thought it would be really interesting in
the context of some of our newer codecs
under development in trunk and the bulkpostings branch.

I found the results presented there based on index sorting for codecs like simple9 to be really
compelling, significant reduction
in bits/posting for docids especially, because it can pack a lot of small deltas efficiently.

{noformat}
The first method reorders the documents in a text collection based on the number of
distinct terms contained in each document. The idea is that two documents that each
contain a large number of distinct terms are more likely to share terms than are a
document with many distinct terms and a document with few distinct terms. Therefore,
by assigning docids so that documents with many terms are close together, we may
expect a greater clustering effect than by assigning docids at random.

The second method assumes that the documents have been crawled from the Web (or
maybe a corporate Intranet). It reassigns docids in lexicographical order of URL. The
idea here is that two documents from the same Web server (or maybe even from the
same directory on that server) are more likely to share common terms than two random
documents from unrelated locations on the Internet.
{noformat}

http://www.ir.uwaterloo.ca/book/06-index-compression.pdf (see page 214: doc id reordering)


> Index sorter
> ------------
>
>                 Key: LUCENE-2482
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2482
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>    Affects Versions: 3.1, 4.0
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>             Fix For: 3.1, 4.0
>
>         Attachments: indexSorter.patch
>
>
> A tool to sort index according to a float document weight. Documents with high weight
are given low document numbers, which means that they will be first evaluated. When using
a strategy of "early termination" of queries (see TimeLimitedCollector) such sorting significantly
improves the quality of partial results.
> (Originally this tool was created by Doug Cutting in Nutch, and used norms as document
weights - thus the ordering was limited by the limited resolution of norms. This is a pure
Lucene version of the tool, and it uses arbitrary floats from a specified stored field).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message