lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Rutherglen" <jason.rutherg...@gmail.com>
Subject Re: yet again: getting the minimum and maximum value of a field
Date Wed, 25 Jun 2008 15:23:23 GMT
I looked heavily at this.  It requires a customization of TermInfosReader
whereby the tii (term dictionary) SegmentTermEnum is traversed looking for
the last term with a particular field.  Once found, from that position in
the tis SegmentTermEnum would need to be traversed again for the last term
with the desired field.  This would be the most efficient using the current
system.

In the Tag Index I am working on
https://issues.apache.org/jira/browse/LUCENE-1292, the max values would be
stored automatically at the end of the tis file.  They would be available
via a subclass of TermEnum that adds a method "public Term getMax(String
field)"

On Wed, Jun 25, 2008 at 5:43 AM, Christian Reuschling <
christian.reuschling@gmail.com> wrote:

> Hello people,
>
> yes, there were several threads about this topic, but I sadly have to
> respawn
> it, I'm sorry.
>
> The first I found was a discussion from May 2005:
>
>
> http://mail-archives.apache.org/mod_mbox/lucene-java-user/200505.mbox/%3cPine.LNX.4.58.0505302221330.28735@hal.rescomp.berkeley.edu%3e
>
> There the final solution suggestion from Hoss was to try it with a binary
> search
> on the TermEnum
>
> The second one was also from May 2005, it seems that it was a follow-up:
>
>
> http://mail-archives.apache.org/mod_mbox/lucene-java-user/200505.mbox/%3CPine.LNX.4.58.0505311145460.29003@hal.rescomp.berkeley.edu%3E
>
> http://markmail.org/message/rp4xfdclsha7h5uq#query:termenum%20lucene%20maximum%20value+page:1+mid:q3doh6tvyf6swl6h+state:results
>
> Here, the solution with the FieldCache was discussed, and also another
> direction
> with a RangeQuery which results in a TooMAnyClauses that can be avoided by
> setting the 'allowed clauses count' to a bigger number.
>
>
>
>
> I use the solution with the FieldCache, which worked fine for a long time,
> but
> when I use it with bigger fields with millions of entries, I have big peaks
> in
> memory consumption, which sometimes result in an OutOfMemory Error. I
> assume
> that with the range query, I will fall in the same Problem.
>
>
> I now want to try out the solution from Hoss with the binary search over
> the
> TermEnum, but it is not clear for me how to perform this.
>
> The only methods in TermEnum are
>
>  public abstract boolean next()
>  public abstract Term term();
>  public abstract int docFreq();
>  public abstract void close()
>  public boolean skipTo(Term target)
>
> Whereby skipTo "Skips terms to the first beyond the current whose value is
> greater or equal to 'target'. Returns true iff there is such an entry."
>
> How to avoid to perfom the 'big loop with next' until I am at the last
> entry,
> like the current implementation of skipTo:
>
>     do {
>        if (!next())
>                return false;
>     } while (target.compareTo(term()) > 0);
>     return true;
>
> Whereby target() would be the over biggest value we could think about, and
> I
> remember the term bevore the method returns false.
>
> Because of the tree-like architecture of the index, where the letters are
> some
> kind of nodes, e.g.
>
>     a               z
>   ab   ar         ze  zu
>  abi     ark     zer    zul
>
> I would assume that there is a fast possibility to determine that 'abi' is
> the
> minimum and 'zul' the maximum for that field - by simply walking through
> the
> tree 'left - or rightwise' (when I only get the left node, I will walk to
> the
> minimum, when I only get the right node through walking, I will get the
> maximum)
>
> But this is a theoretical view. Enables the Lucene implementation
> walkthroughs
> like this? At least the RangeQuery implementation I would assume walks
> throgh
> the tree.
>
>
> Thanks for all answers!
>
> kindly regards
>
> Chris
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message