lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ryan McKinley" <ryan...@gmail.com>
Subject Re: distinct term values?
Date Thu, 05 Apr 2007 02:49:04 GMT
TermEnum works like a charm, no need to optimize (yet).

Enjoy the Merlot!


On 4/4/07, Erick Erickson <erickerickson@gmail.com> wrote:
> Sorry if this is a double post, but my last attempt failed......
>
>
> Not that I know of, but I think you'll be surprised how fast TermEnum
> will walk the list of terms.
>
> I think you misunderstand TermEnum. It will NOT enumerate a term
> twice, so there's no need for a hash, just a simple increment of a
> counter will do.
>
> And if you're really concerned about efficiency, you could always create
> this list for all fields at warm-up or even index time and compute it once.
> Assuming that you don't have many thousands of fields......
>
> I'd also mention that there's no reason that every document has the
> same fields. So, at index creation time you could easily create a unique
> document with fields that were orthogonal to all other documents, and
> populate it with the summary information for term counts. Imagine the
> following....
>
> Most all your document have field
> a, b, c, d, e, f.
>
>
> Your special meta-data document has field z which contains
> an XML-formatted field that looks something like...
> <meta>
>   <countfielda>32</countfielda>
>   <countfieldb>901</countfieldb>
>   <countfieldc>14052</countfieldc>
>      .
>      .
>    <countfieldf>403</countfieldf>
> </meta>
>
> and a field y that only contains "superspecialmetadocumentdata".
>
> Now, at startup, you perform a simple search on
> y:"superspecialmetadocumentdata", get the single document,
> read the "z" field and store away all the counts for later lookup.
>
> Now, your lookup time is very, very short.
>
> But I wouldn't do any of this until I was sure that the performance
> of the simple approach of walking the TermEnum was too
> expensive. Premature optimization and all that......
>
>
> If this is clear as mud, I'll add details, but I'm 1/2 the way through a
> bottle of Merlot <G>....
>
> Best
> Erick
>
>
> On 4/4/07, Ryan McKinley <ryantxu@gmail.com> wrote:
> >
> > Is there an efficient way to know how many distinct terms there are
> > for a given field name?
> >
> > I know I can walk through a TermEnum and put them into a hash, but it
> > would be useful to know beforehand if you are going to get 4 distinct
> > values or 40,000
> >
> > I don't need to know what the terms are, just how many.
> >
> > name:A
> > name:A
> > name:B
> >
> > I want a function like:
> >   getDistinctTermValues( "name" ) -> 2
> >
> > Thanks
> > ryan
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message