lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christian Reuschling <christian.reuschl...@gmail.com>
Subject yet again: getting the minimum and maximum value of a field
Date Wed, 25 Jun 2008 15:05:08 GMT
Hello people,


I'm sorry if I have send this message twice - my gmail interface merges the
mails in the 'send' folder with incoming mails from my adress - strange, but
I can't say if the mail was sent - I only see it in the send-folder (with
only one label on it, which brings me to send it again :( ) Okay.



yes, there were several threads about this topic, but I sadly have to respawn
it, I'm sorry.

The first I found was a discussion from May 2005:

http://mail-archives.apache.org/mod_mbox/lucene-java-user/200505.mbox/%3cPine.LNX.4.58.0505302221330.28735@hal.rescomp.berkeley.edu%3e

There the final solution suggestion from Hoss was to try it with a binary search
on the TermEnum

The second one was also from May 2005, it seems that it was a follow-up:

http://mail-archives.apache.org/mod_mbox/lucene-java-user/200505.mbox/%3CPine.LNX.4.58.0505311145460.29003@hal.rescomp.berkeley.edu%3E
http://markmail.org/message/rp4xfdclsha7h5uq#query:termenum%20lucene%20maximum%20value+page:1+mid:q3doh6tvyf6swl6h+state:results

Here, the solution with the FieldCache was discussed, and also another direction
with a RangeQuery which results in a TooMAnyClauses that can be avoided by
setting the 'allowed clauses count' to a bigger number.




I use the solution with the FieldCache, which worked fine for a long time, but
when I use it with bigger fields with millions of entries, I have big peaks in
memory consumption, which sometimes result in an OutOfMemory Error. I assume
that with the range query, I will fall in the same Problem.


I now want to try out the solution from Hoss with the binary search over the
TermEnum, but it is not clear for me how to perform this.

The only methods in TermEnum are

  public abstract boolean next()
  public abstract Term term();
  public abstract int docFreq();
  public abstract void close()
  public boolean skipTo(Term target)

Whereby skipTo "Skips terms to the first beyond the current whose value is
greater or equal to 'target'. Returns true iff there is such an entry."

How to avoid to perfom the 'big loop with next' until I am at the last entry,
like the current implementation of skipTo:

      do {
         if (!next())
               return false;
      } while (target.compareTo(term()) > 0);
      return true;

Whereby target() would be the over biggest value we could think about, and I
remember the term bevore the method returns false.

Because of the tree-like architecture of the index, where the letters are some
kind of nodes, e.g.

      a               z
    ab   ar         ze  zu
  abi     ark     zer    zul

I would assume that there is a fast possibility to determine that 'abi' is the
minimum and 'zul' the maximum for that field - by simply walking through the
tree 'left - or rightwise' (when I only get the left node, I will walk to the
minimum, when I only get the right node through walking, I will get the maximum)

But this is a theoretical view. Enables the Lucene implementation walkthroughs
like this? At least the RangeQuery implementation I would assume walks throgh
the tree.


Thanks for all answers!

kindly regards

Chris

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message