lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Phil Whelan <phil...@gmail.com>
Subject Re: Enumerating NumericField using TermEnum?
Date Sun, 13 Sep 2009 17:20:26 GMT
Hi Uwe,

Thanks for the explanation! It really helps. That makes sense that for
a small number of values, such as "hour" NumericField is not going to
help me. I'm experimenting with using epoch NumericField for sorting,
which funnily is where I started with 2.4.1, before going down the
usual TooManyClauses path and breaking it down to multiple fields. 2.9
seems a great improvement there. Downloading the new 2.9 rc4...

Thanks,
Phil

On Sat, Sep 12, 2009 at 1:55 AM, Uwe Schindler <uwe@thetaphi.de> wrote:
> Hi Phil,
>
> thanks for checking out NumericField. I have two comments about your
> problem:
>
>> I've used NumericField to store my "hour" field.
>>
>> Example...
>>
>>      doc.add(new
>> NumericField("hour").setIntValue(Integer.parseInt("12")));
>
> NumericField uses a spezial encoding of terms for fast NumericRangeQueries.
> It indexes more than one term per value. How many terms depends on the
> precisionStep ctor parameter. If you set it to infinity (or something ge the
> bit size of your value, 32 for ints, it indexes exactly one value). These
> terms are used for very fast numeric queries. This extra overhead only has a
> positive impact for field with high cardinality (something > 500). For a
> simple hour field with 24 distinct values, the speed impact of
> NumericRangeQuery would be neglectible, it may even be a little bit slower
> because of additional overhead. I would suggest to use NumericField ony for
> real high-cardinality fields (like unix time stamps, prices,
> latitudes/longitudes (all types of float/doubles), day of year,...).
>
> Maybe I add this t the javadocs.
>
>> Before I was using plain string Field and enumerating them with
>> TermEnum, which worked fine.
>> Now I'm using NumericField's I'm not sure how to port this enumeration
>> code.
>
> As explained above, each numerfic value is indexed by more than one term, so
> your termenum is of no use. There are some tricks to get the distict values,
> but this needs deeper knowledge of the underlying term structure encoding of
> terms, shift value,... - see the FieldCache parsers for numeric fields).
>
> As your field (hours) is of low cardinality, you can index with
> precisionStep=Integer.MAX_VALUE. Range queries will be not faster than with
> normal TermRangeQuery and your term enum will work. You only have to use
> NumericUtils.prefixCodedToInt() to decode the term into a int:
>
> hours.add( Integer.valueOf(NumericUtils.prefixCodedToInt(term.text()) );
>
> This code would also work for other precision steps, but you would get some
> additional "lower precision terms" (values with some lower bits removed).
> You have to break iteration in this case (see FieldCache code).
>
>> Any pointers?
>>
>> This is the code I was using previously for plain Fields.
>>
>>     ArrayList hours = new ArrayList();
>>     TermEnum termEnum = reader.terms( new Term( "hour", "" ) );
>>     Term term = null;
>>     while ( ( term = termEnum.term() ) != null ) {
>>
>>         if ( ! term.field().equals( "hour" ) )
>>             break;
>>
>>         hours.add( (Integer)term.text() );
>>         termEnum.next();
>>     }
>>
>> Thanks,
>> Phil
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Mobile: +1  778-233-4935
Website: http://philw.co.uk
Skype: philwhelan76
Twitter: philwhln
Email : phil123@gmail.com
iChat: philwhln@mac.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message