lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <ser...@gmail.com>
Subject Re: Grouping on multiple shards possible in lucene?
Date Tue, 20 Nov 2012 20:29:31 GMT
Hi Ravi,

I've been dealing with reverse indexing lately, so let me share with you a
bit of my experience thus far.

First, you need to define what does reverse indexing mean for you. If it
means that docs that were indexed in the following order: d1, d2, d3 should
be traversed during search in that order: d3, d2, d1 - then that's one
thing.
However, if it means that the traversal needs to occur by e.g. the
documents' timestamp, as a means to process documents from latest to
oldest, then that's a totally different thing, and way more complicated.

You will need to think about an IndexReader which reverses the order of the
segments that it reads, so that segments are processed from latest to
oldest. Also, you might need to merge the segments in reverse order too
(i.e. if segments s1, s4, s5 are merged, merge them as s5, s4, s1).

If you are interested in timestamp based sorting, it gets complicated.
Documents flow in from multiple producers (e.g. a parallel crawler,
different processes which feed documents to the index et.c) and processed
usually by multiple consumers (indexing threads). That makes sorting the
index based on a timestamp difficult.

Lucene used to have IndexSorter (before 4.0) which could sort an index by a
field. That was an offline process and if that's what you're after -- you
should do just that and forget about the rest. If however you're interested
in an on-line process, where documents are fed in some order and searched
in the exact true order (latest to oldest), that's a more complicated
solution -- I'm still working on it :).

HTH

Shai

On Tue, Nov 20, 2012 at 5:37 PM, Ravikumar Govindarajan <
ravikumar.govindarajan@gmail.com> wrote:

> But, I think it should be possible with some fun codec & merge policy
> & MultiReader magic, to have docIDs assigned in "reverse chronological
> order"
>
> Can you explain it a bit more? I was thinking perhaps we store absolute
> doc-ids instead of delta to do reverse traversal. But this could waste a
> lot of storage
>
> The default merge policy will merge adjacent segments no? Is it going to
> disturb the ordering?
>
> --
> Ravi
>
> On Tue, Nov 20, 2012 at 5:19 PM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
> > On Tue, Nov 20, 2012 at 1:49 AM, Ravikumar Govindarajan
> > <ravikumar.govindarajan@gmail.com> wrote:
> > > Thanks Mike. Actually, I think I can eliminate sort-by-time, if I am
> able
> > > to iterate postings in reverse doc-id order. Is this possible in
> lucene?
> >
> > Alas that is not easy to do in Lucene: the posting lists are encoded
> > in forward docID order.
> >
> > But, I think it should be possible with some fun codec & merge policy
> > & MultiReader magic, to have docIDs assigned in "reverse chronological
> > order" ...
> >
> > > Also, for a TopN query sorted by doc-id will the query terminate early?
> >
> > Actually, it won't!  But it really should ... you could make a
> > Collector that throws an exception once the N docs have been
> > collected?
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message