lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Ordering of terms in TermsEnum
Date Wed, 22 May 2013 16:46:55 GMT
On Wed, May 22, 2013 at 11:28 AM, Brendan Grainger
<brendan.grainger@gmail.com> wrote:
> Hi All,
>
> Sorry if this is a stupid question, but I'm still catching up with some of
> the new APIs and I want to make sure my assumptions are correct.
>
> Anyway, I'm the solr PathHierachyTokenizer to create a number of paths,
> e.g. for a book object say with a category field of /compsci/search/lucene
> the PathHierachyTokenizer creates the following tokens and they are added
> to a multivalued field called 'categories'
>
> /compsci
> /compsci/search
> /compsci/search/lucene
>
> I then want to iterate over these categories using a TermsEnum. This is the
> relevant code:
>
>   Terms terms = fields.terms('categories');
>   if (terms == null) return null;
>   TermsEnum termsEnum = terms.iterator(null);
>
>   BytesRef text;
>   while((text = termsEnum.next()) != null) {
>      System.out.println("field=categories; text=" + text.utf8ToString());
>
>
> My question is, is it guaranteed that the order of the terms as they're
> enumerated will be
>
> /compsci
> /compsci/search
> /compsci/search/lucene
>
> and if in another document I added /compsci/graphics/3d then the terms as
> I enumerate them would be:
>
> /compsci
> /compsci/graphics
> /compsci/graphics/3d
> /compsci/search
> /compsci/search/lucene

The short answer is "yes".

Longer answer: terms are sorted according to the codec's
TermsConsumer.getComparator(), but all codecs I know of just use
Unicode comparator (BytesRef.getUTF8SortedAsUnicodeComparator).

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message