lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eks Dev (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2482) Index sorter
Date Thu, 27 May 2010 20:45:38 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12872357#action_12872357
] 

Eks Dev commented on LUCENE-2482:
---------------------------------

nice! 
There is also another interesting use case for sorting index, performance and index size!

We use a couple of fields with low cardinality (zip code, user group... and likes). Having
index sorted on these makes rle compression of  postings really effective, making it possible
to load all values into couple of M-bytes of ram.
At a moment we just sort collection before indexing.

Would  it be possible somehow to use a combination of stored fields and to specify comparator?
Even comparing them as byte[] would do the trick for this business case as it is only important
to keep the same values together, order is irrelevant. Of course, having decoder to decode
byte[] before comparing would be useful (e.g. for composite fields) , but would work in many
cases without it.   

This works fine even with moderate update rate, as you can re-sort periodically. It does not
have to be totally sorted, everything works, just slightly more memory is needed for filters

With flex, having postings that use rle compression is quite possible ... this tool could
become "optimizeHard()" tool for some indexes :)

> Index sorter
> ------------
>
>                 Key: LUCENE-2482
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2482
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>    Affects Versions: 3.1
>            Reporter: Andrzej Bialecki 
>             Fix For: 3.1
>
>         Attachments: indexSorter.patch
>
>
> A tool to sort index according to a float document weight. Documents with high weight
are given low document numbers, which means that they will be first evaluated. When using
a strategy of "early termination" of queries (see TimeLimitedCollector) such sorting significantly
improves the quality of partial results.
> (Originally this tool was created by Doug Cutting in Nutch, and used norms as document
weights - thus the ordering was limited by the limited resolution of norms. This is a pure
Lucene version of the tool, and it uses arbitrary floats from a specified stored field).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message