lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: distinct term values?
Date Thu, 05 Apr 2007 01:01:22 GMT
Sorry if this is a double post, but my last attempt failed......


Not that I know of, but I think you'll be surprised how fast TermEnum
will walk the list of terms.

I think you misunderstand TermEnum. It will NOT enumerate a term
twice, so there's no need for a hash, just a simple increment of a
counter will do.

And if you're really concerned about efficiency, you could always create
this list for all fields at warm-up or even index time and compute it once.
Assuming that you don't have many thousands of fields......

I'd also mention that there's no reason that every document has the
same fields. So, at index creation time you could easily create a unique
document with fields that were orthogonal to all other documents, and
populate it with the summary information for term counts. Imagine the
following....

Most all your document have field
a, b, c, d, e, f.


Your special meta-data document has field z which contains
an XML-formatted field that looks something like...
<meta>
  <countfielda>32</countfielda>
  <countfieldb>901</countfieldb>
  <countfieldc>14052</countfieldc>
     .
     .
   <countfieldf>403</countfieldf>
</meta>

and a field y that only contains "superspecialmetadocumentdata".

Now, at startup, you perform a simple search on
y:"superspecialmetadocumentdata", get the single document,
read the "z" field and store away all the counts for later lookup.

Now, your lookup time is very, very short.

But I wouldn't do any of this until I was sure that the performance
of the simple approach of walking the TermEnum was too
expensive. Premature optimization and all that......


If this is clear as mud, I'll add details, but I'm 1/2 the way through a
bottle of Merlot <G>....

Best
Erick


On 4/4/07, Ryan McKinley <ryantxu@gmail.com> wrote:
>
> Is there an efficient way to know how many distinct terms there are
> for a given field name?
>
> I know I can walk through a TermEnum and put them into a hash, but it
> would be useful to know beforehand if you are going to get 4 distinct
> values or 40,000
>
> I don't need to know what the terms are, just how many.
>
> name:A
> name:A
> name:B
>
> I want a function like:
>   getDistinctTermValues( "name" ) -> 2
>
> Thanks
> ryan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message