Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Message-ID: <403E39F4.20403@apache.org>
Date: Thu, 26 Feb 2004 10:24:52 -0800
From: Doug Cutting <cutting@apache.org>
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040116
MIME-Version: 1.0
To: Lucene Users List <lucene-user@jakarta.apache.org>
Subject: Re: Iterating TermEnum backwards
References: <00fd01c3fc24$d194adb0$ff09050a@dscp02272>
 <403D9567.5010705@ctx.com.au>
In-Reply-To: <403D9567.5010705@ctx.com.au>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit

Matt Quail wrote:
> Is there any way to iterate through a TermEnum backwards? Okay, I know
> that there isn't a way to do this via the TermEnum class, but is it
> "implementable" on top of the underlying Lucene datastore?

Not really.  The best you can do is skip back to the previous "indexed" 
term in TermInfosReader.indexTerms, and scan forward from there.  You 
could try adding a method to that class like:

   final synchronized void seekBefore(Term term) throws IOException {
     int offset = getIndexOffset(term);
     seekEnum(offset > 0 ? offset - 1 : offset);
   }

Then you'd need to add stuff to MultiReader, SegmentReader and 
IndexReader, to take advantage of this.  It could get a little tricky, 
but it is possible.  I'm not convinced this is your best route.

> My particular problem is this:
> 
> I have an index of documents, each document has a "date" field (I'm
> using DateField). Most documents have a different date, so the number of
> unique dates is close to the number of documents.

Are you adding documents in date order?  If so, then you could look at 
the date of the document numbered maxDoc() - N and scan forward from 
there.  To be safe, you could start at maxDoc() - N*2 or something.

> I want to find the top N most recent dates, but I don't want to have to
> iterate through ALL of them first. NB: With DateField, the earlier dates
> are lexocographically smaller. (I also want to find the most recent N
> less than some date D).
> 
> I know I could "invert" my dates (something like MAX_LONG - date) to get
> the REVERSE order, but I want to be able to do "least recent" and "most
> recent".

Why not have two date fields, one inverted and one not?

> PS: my current solution is to do a binary search between MIN and MAX,
> halving my search space until I find close to N matching documents.

That doesn't sound like a bad solution.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org