lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doron Cohen (JIRA)" <>
Subject [jira] Updated: (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)
Date Thu, 12 Aug 2010 12:13:17 GMT


Doron Cohen updated LUCENE-1812:

    Attachment: pruning.patch

The pruning framework is pretty cool - it is quite easy to add a new pruning policy!

Initially I planned to focus on CarmelPruningPolicy plus add the more sophisticated algorithm
(tpoK) described in the paper, but eventually found myself doing more changes - Andrzej, I
hope you like the changes - like some methods I renamed - otherwise please feel free to rename
them back.

+Patch Details+

* Documentation changes - mainly moved things to where I thought they belong, like moving
from CarmelPruning to TFPruning the general discussion that applies to any Term pruning implementation.
* Renamed some of TermPruningPolicy methods for better readability - well, to me :\) -- hope
you agree with the new names, othewise please feel free to change back.
* Renamed CarmelTermPruningPolicy to CarmelTermPruningUniformPolicy. - quite a long name...
but descriptive, as this an enhanced form of the "uniform" case from the paper. Modified documentation
* Added CarmelTermPruningTopKPolicy - this is the more sophisticated/strong form of pruning
described in the paper. (Test case added.)
* Fixed some compiler warnings (1.5, Lucene.Version..)
* Bug (\?) in CarmelTermPruningPolicy in initTermPositions() - it sorts by docid before selecting
the top subset, but in fact this seems dead code? Added a "TODO deadcode" there, maybe I am
missing something.
* Enabled topK pruning through the PruningTool program (untested)
* Simplified the if statements in - hopefully not missing
something t here...

There's more to do though not sure when I'll have the cycles..:
* Quality/performance test for the topK pruning algorithm - using LAtimes or some other judged
  Or perhaps Robert can try it on that Persian test collection. 
* Add also the "Delta Pruning" policy as described in the paper
* Junit for CarmelTermPruningUniformPolicy


> Static index pruning by in-document term frequency (Carmel pruning)
> -------------------------------------------------------------------
>                 Key: LUCENE-1812
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>    Affects Versions: 2.9, 3.1
>            Reporter: Andrzej Bialecki 
>         Attachments: pruning.patch, pruning.patch, pruning.patch, pruning.patch
> This module provides tools to produce a subset of input indexes by removing postings
data for those terms where their in-document frequency is below a specified threshold. The
net effect of this processing is a much smaller index that for common types of queries returns
nearly identical top-N results as compared with the original index, but with increased performance.

> Optionally, stored values and term vectors can also be removed. This functionality is
largely independent, so it can be used without term pruning (when term freq. threshold is
set to 1).
> As the threshold value increases, the total size of the index decreases, search performance
increases, and recall decreases (i.e. search quality deteriorates). NOTE: especially phrase
recall deteriorates significantly at higher threshold values. 
> Primary purpose of this class is to produce small first-tier indexes that fit completely
in RAM, and store these indexes using IndexWriter.addIndexes(IndexReader[]). Usually the performance
of this class will not be sufficient to use the resulting index view for on-the-fly pruning
and searching. 
> NOTE: If the input index is optimized (i.e. doesn't contain deletions) then the index
produced via IndexWriter.addIndexes(IndexReader[]) will preserve internal document id-s so
that they are in sync with the original index. This means that all other auxiliary information
not necessary for first-tier processing, such as some stored fields, can also be removed,
to be quickly retrieved on-demand from the original index using the same internal document
> Threshold values can be specified globally (for terms in all fields) using defaultThreshold
parameter, and can be overriden using per-field or per-term values supplied in a thresholds
map. Keys in this map are either field names, or terms in field:text format. The precedence
of these values is the following: first a per-term threshold is used if present, then per-field
threshold if present, and finally the default threshold.
> A command-line tool (PruningTool) is provided for convenience. At this moment it doesn't
support all functionality available through API.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message