lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Wang <john.w...@gmail.com>
Subject Re: TermCount per fiend
Date Tue, 22 Sep 2009 00:11:25 GMT
Thanks Michael!

Makes lotta sense to me to wait for LUCENE-1458 then. Should I create an
issue with a depedency on 1458?

One application for this is within FieldCache construction of StringIndex:

If we know the number of terms is small, the orderArray using an int per doc
is wasteful. In the case where we have 10 terms but 100M docs for a given
field, the orderArray would take up 400MB where as half a byte is
sufficient, which means 50MB is enough. (keep in mind this is per field!)

To do such memory optimization now requires iterating the term table twice
to get the number, hence the movition for this feature.

Thanks

-John

On Tue, Sep 22, 2009 at 2:17 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> MultiReaders can't quickly compute the exact term count.  Would they
> be allowed to throw UOE?  (Like IndexReader.getUniqueTermCount)
>
> TermsHashPerField.numPostings (not .numPostingsInt) tells you the #
> unique terms currently in IndexWriter's RAM buffer, so I think we
> could save that out with FieldInfo.  That seems reasonable?
>
> We could also compute it at search time, because the SegmentTermEnum
> knows its position.  Ie you could seek to first term of field X and
> then first term of field after X and subtract the positions.  But, the
> position is not exposed publicly now, and this'd be more costly to do
> (though we could cache & reuse the result).  It wouldn't involve
> changing the index format.
>
> With LUCENE-1458 this becomes simple (it already keeps track of each
> fields's terms, separately, including total number of terms for that
> field).
>
> Mike
>
> On Mon, Sep 21, 2009 at 9:14 AM, John Wang <john.wang@gmail.com> wrote:
> > Hi guys:
> >      Not sure if this would be a better fit on the users or the dev list.
> >      It would be very useful to be able to get term count given a field,
> > e.g.
> >      int IndexReader.termCount(String field)
> >      Wanted to get your opinion on what is the best way to approach this.
> > After looking through the code, seems like we do have it stored
> > in TermsHashPerField.numPostingInt. (hopefully I am reading it correctly)
> >     Is it possible to add to the FieldInfo class and write it out?
> >
> > Thanks
> > -John
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Mime
View raw message