lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chuck Williams (JIRA)" <>
Subject [jira] Updated: (LUCENE-602) [PATCH] Filtering tokens for position and term vector storage
Date Fri, 16 Jun 2006 04:50:31 GMT
     [ ]

Chuck Williams updated LUCENE-602:

    Attachment: TokenSelectorSoloAll.patch

TokenSelectorSoloAll.patch applies against today's svn head.  It only requires Java 1.4.

> [PATCH] Filtering tokens for position and term vector storage
> -------------------------------------------------------------
>          Key: LUCENE-602
>          URL:
>      Project: Lucene - Java
>         Type: New Feature

>   Components: Index
>     Versions: 2.1
>     Reporter: Chuck Williams
>  Attachments: TokenSelectorSoloAll.patch
> This patch provides a new TokenSelector mechanism to select tokens of interest and creates
two new IndexWriter configuration parameters:  termVectorTokenSelector and positionsTokenSelector.
> termVectorTokenSelector, if non-null, selects which index tokens will be stored in term
vectors.  If positionsTokenSelector is non-null, then any tokens it rejects will have only
their first position stored in each document (it is necessary to store one position to keep
the doc freq properly to avoid the token being garbage collected in merges).
> This mechanism provides a simple solution to the problem of minimzing index size overhead
cause by storing extra tokens that facilitate queries, in those cases where the mere existence
of the extra tokens is sufficient.  For example, in my test data using reverse tokens to speed
prefix wildcard matching, I obtained the following index overheads:
>   1.  With no TokenSelectors:  60% larger with reverse tokens than without
>   2.  With termVectorTokenSelector rejecting reverse tokens:  36% larger
>   3.  With both positionsTokenSelector and termVectorTokenSelector rejecting reverse
tokens:  25% larger
> It is possible to obtain the same effect by using a separate field that has one occurrence
of each reverse token and no term vectors, but this can be hard or impossible to do and a
performance problem as it requires either rereading the content or storing all the tokens
for subsequent processing.
> The solution with TokenSelectors is very easy to use and fast.
> Otis, thanks for leaving a comment in QueryParser.jj with the correct production to enable
prefix wildcards!  With this, it is a straightforward matter to override the wildcard query
factory method and use reverse tokens effectively.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message