lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Enumerating NumericField using TermEnum?
Date Sat, 12 Sep 2009 08:55:12 GMT
Hi Phil,

thanks for checking out NumericField. I have two comments about your
problem:

> I've used NumericField to store my "hour" field.
> 
> Example...
> 
>      doc.add(new
> NumericField("hour").setIntValue(Integer.parseInt("12")));

NumericField uses a spezial encoding of terms for fast NumericRangeQueries.
It indexes more than one term per value. How many terms depends on the
precisionStep ctor parameter. If you set it to infinity (or something ge the
bit size of your value, 32 for ints, it indexes exactly one value). These
terms are used for very fast numeric queries. This extra overhead only has a
positive impact for field with high cardinality (something > 500). For a
simple hour field with 24 distinct values, the speed impact of
NumericRangeQuery would be neglectible, it may even be a little bit slower
because of additional overhead. I would suggest to use NumericField ony for
real high-cardinality fields (like unix time stamps, prices,
latitudes/longitudes (all types of float/doubles), day of year,...).

Maybe I add this t the javadocs.

> Before I was using plain string Field and enumerating them with
> TermEnum, which worked fine.
> Now I'm using NumericField's I'm not sure how to port this enumeration
> code.

As explained above, each numerfic value is indexed by more than one term, so
your termenum is of no use. There are some tricks to get the distict values,
but this needs deeper knowledge of the underlying term structure encoding of
terms, shift value,... - see the FieldCache parsers for numeric fields).

As your field (hours) is of low cardinality, you can index with
precisionStep=Integer.MAX_VALUE. Range queries will be not faster than with
normal TermRangeQuery and your term enum will work. You only have to use
NumericUtils.prefixCodedToInt() to decode the term into a int:

hours.add( Integer.valueOf(NumericUtils.prefixCodedToInt(term.text()) );

This code would also work for other precision steps, but you would get some
additional "lower precision terms" (values with some lower bits removed).
You have to break iteration in this case (see FieldCache code).

> Any pointers?
> 
> This is the code I was using previously for plain Fields.
> 
>     ArrayList hours = new ArrayList();
>     TermEnum termEnum = reader.terms( new Term( "hour", "" ) );
>     Term term = null;
>     while ( ( term = termEnum.term() ) != null ) {
> 
>         if ( ! term.field().equals( "hour" ) )
>             break;
> 
>         hours.add( (Integer)term.text() );
>         termEnum.next();
>     }
> 
> Thanks,
> Phil
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message