Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 8105 invoked from network); 12 Sep 2009 08:55:49 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 12 Sep 2009 08:55:49 -0000 Received: (qmail 11315 invoked by uid 500); 12 Sep 2009 08:55:46 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 11223 invoked by uid 500); 12 Sep 2009 08:55:46 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 11213 invoked by uid 99); 12 Sep 2009 08:55:46 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 12 Sep 2009 08:55:46 +0000 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [85.25.71.29] (HELO mail.troja.net) (85.25.71.29) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 12 Sep 2009 08:55:37 +0000 Received: from localhost (localhost.localdomain [127.0.0.1]) by mail.troja.net (Postfix) with ESMTP id 5179845F276 for ; Sat, 12 Sep 2009 10:55:16 +0200 (CEST) X-Virus-Scanned: Debian amavisd-new at mail.troja.net Received: from mail.troja.net ([127.0.0.1]) by localhost (megaira.troja.net [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id RqBeSB8Dbkxe for ; Sat, 12 Sep 2009 10:55:13 +0200 (CEST) Received: from VEGA (port-83-236-62-3.dynamic.qsc.de [83.236.62.3]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.troja.net (Postfix) with ESMTPSA id D38E845F275 for ; Sat, 12 Sep 2009 10:55:12 +0200 (CEST) From: "Uwe Schindler" To: References: <9cafbc680909111500j156eafe3u2cfcd21b4f57ce37@mail.gmail.com> Subject: RE: Enumerating NumericField using TermEnum? Date: Sat, 12 Sep 2009 10:55:12 +0200 Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook 11 Thread-Index: AcozK27bbDgrUDZAS7+ufAEOZzITkwAWgIqQ In-Reply-To: <9cafbc680909111500j156eafe3u2cfcd21b4f57ce37@mail.gmail.com> X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.5579 X-Virus-Checked: Checked by ClamAV on apache.org Hi Phil, thanks for checking out NumericField. I have two comments about your problem: > I've used NumericField to store my "hour" field. > > Example... > > doc.add(new > NumericField("hour").setIntValue(Integer.parseInt("12"))); NumericField uses a spezial encoding of terms for fast NumericRangeQueries. It indexes more than one term per value. How many terms depends on the precisionStep ctor parameter. If you set it to infinity (or something ge the bit size of your value, 32 for ints, it indexes exactly one value). These terms are used for very fast numeric queries. This extra overhead only has a positive impact for field with high cardinality (something > 500). For a simple hour field with 24 distinct values, the speed impact of NumericRangeQuery would be neglectible, it may even be a little bit slower because of additional overhead. I would suggest to use NumericField ony for real high-cardinality fields (like unix time stamps, prices, latitudes/longitudes (all types of float/doubles), day of year,...). Maybe I add this t the javadocs. > Before I was using plain string Field and enumerating them with > TermEnum, which worked fine. > Now I'm using NumericField's I'm not sure how to port this enumeration > code. As explained above, each numerfic value is indexed by more than one term, so your termenum is of no use. There are some tricks to get the distict values, but this needs deeper knowledge of the underlying term structure encoding of terms, shift value,... - see the FieldCache parsers for numeric fields). As your field (hours) is of low cardinality, you can index with precisionStep=Integer.MAX_VALUE. Range queries will be not faster than with normal TermRangeQuery and your term enum will work. You only have to use NumericUtils.prefixCodedToInt() to decode the term into a int: hours.add( Integer.valueOf(NumericUtils.prefixCodedToInt(term.text()) ); This code would also work for other precision steps, but you would get some additional "lower precision terms" (values with some lower bits removed). You have to break iteration in this case (see FieldCache code). > Any pointers? > > This is the code I was using previously for plain Fields. > > ArrayList hours = new ArrayList(); > TermEnum termEnum = reader.terms( new Term( "hour", "" ) ); > Term term = null; > while ( ( term = termEnum.term() ) != null ) { > > if ( ! term.field().equals( "hour" ) ) > break; > > hours.add( (Integer)term.text() ); > termEnum.next(); > } > > Thanks, > Phil > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org