Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Message-ID: <1282275040.1235330461782.JavaMail.jira@brutus>
Date: Sun, 22 Feb 2009 11:21:01 -0800 (PST)
From: "Uwe Schindler (JIRA)" <jira@apache.org>
To: java-dev@lucene.apache.org
Subject: [jira] Issue Comment Edited: (LUCENE-1541) Trie range - make trie
 range indexing more flexible
In-Reply-To: <717634197.1234898219722.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/LUCENE-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12675695#action_12675695 ] 

thetaphi edited comment on LUCENE-1541 at 2/22/09 11:21 AM:
-----------------------------------------------------------------

After thinking one night longer about the whole issue, I doubt, that non-equidistant precision steps are really needed:

Ning's comment is correct about the number of terms. But if you only index long values from, e.g., 0L to 10000L, you create a lot of terms for shift values 0 to 14 (because the terms are in this range). For shift values 15 to 63, the term is always the same constant term. The index' TermEnum so contains not many additional values (because its only *one* term for *all* documents), only additional TermDocs entries are created. It's the same like adding one "flag" term to all documents. This does not use much additional space in index. When you query a range, these terms are never used, but they do not hurt.

The additional space for the trie terms is generated by higher precision (lower shift) values. If you index with precision step 4 or 2 instead of a precision step of 8, you create a lot of *different* terms for the lower shift values. The constant terms in the higher shifts are still always the same and does not consume much space.

I will create a small comparison on index size for long values without higher bits, but I doubt, that index size without lower precision terms reduces space significant. If this is the case, I do not think the additional complexity of the API is needed for this low impact. If somebody really wants to optimize index size so much, he can create a optimized fork of TrieRange in his project that indexes with non-equidistant precision steps. On the other hand, I would suggest to use ints/floats instead of longs/doubles, if only lower precision is needed. In this case, less terms will be created. For floats in comparision to doubles, the effect will be bigger (because even small doubles use a lot of hgher precision bits), a non-equidistant precision step will not help very much here.

      was (Author: thetaphi):
    After thinking one night longer about the whole issue, I suspect, that non-equidistant precision steps are really needed:

Ning's comment is correct about the number of terms. But if you only index long values from, e.g., 0L to 10000L, you create a lot of terms for shift values 0 to 14 (because the terms are in this range). For shift values 15 to 63, the term is always the same constant term. The index' TermEnum so contains not many additional values (because its only *one* term for *all* documents), only additional TermDocs entries are created. It's the same like adding one "flag" term to all documents. This does not use much additional space in index. When you query a range, these terms are never used, but they do not hurt.

The additional space for the trie terms is generated by higher precision (lower shift) values. If you index with precision step 4 or 2 instead of a precision step of 8, you create a lot of *different* terms for the lower shift values. The constant terms in the higher shifts are still always the same and does not consume much space.

I will create a small comparison on index size for long values without higher bits, but I suspect, that index size without lower precision terms reduces space significant. If this is the case, I do not think the additional complexity of the API is needed for this low impact. If somebody really wants to optimize index size so much, he can create a optimized fork of TrieRange in his project that indexes with non-equidistant precision steps. On the other hand, I would suggest to use ints/floats instead of longs/doubles, if only lower precision is needed. In this case, less terms will be created. For floats in comparision to doubles, the effect will be bigger (because even small doubles use a lot of hgher precision bits), a non-equidistant precision step will not help very much here.
  
> Trie range - make trie range indexing more flexible
> ---------------------------------------------------
>
>                 Key: LUCENE-1541
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1541
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*
>    Affects Versions: 2.9
>            Reporter: Ning Li
>            Assignee: Uwe Schindler
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1541.patch, LUCENE-1541.patch
>
>
> In the current trie range implementation, a single precision step is specified. With a large precision step (say 8), a value is indexed in fewer terms (8) but the number of terms for a range can be large. With a small precision step (say 2), the number of terms for a range is smaller but a value is indexed in more terms (32).
> We want to add an option that different precision steps can be set for different precisions. An expert can use this option to keep the number of terms for a range small and at the same time index a value in a small number of terms. See the discussion in LUCENE-1470 that results in this issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org