lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shane O'Sullivan <shaneosulliv...@gmail.com>
Subject Document-at-a-time or Term-at-a-time searching
Date Mon, 10 Oct 2005 10:40:37 GMT
Hi all,

Does Lucene use Document and a time (DAAT), or Term at a time (TAAT)
searching?

To explain what I'm asking, let's look at a three-term query T1 T2 T3
In term-at-a-time (TAAT) evaluation, the engine would typically start by
scanning the entire postings list of T1, and would keep some score
accumulator for all documents that contain T1. The accumulator keeps the
score component contributed to that document by the occurrences of T1.
Then, the engine would scan the postings list for T2. Any document
containing T2 that doesn't have an accumulator from the pass over T1 can be
ignored. The accumulators for documents that also contain T2 are updated to
reflect the additional score that T2 contributes to them. Accumulators for
documents that do not contain T2 can be discarded. The same process repeats
for the scanning of T3's postings list, and at the end the set of
accumulators contains the scores of all result documents. The accumulators
are sorted, and the results are returned. Note that prior to scanning T3's
postings list, the engine could not be sure of any document that qualifies
as a candidate.

In document-at-a-time (DAAT) evaluation, the postings lists of all three
terms are traversed in parallel. Typically, the engine would advance the
iterator over T1 to some document containing T1. Then, it would move the
iterator over T2 to the first document containing T2 whose docID equals to
or is greater than the docID pointed to by T1. If the T2 iterator passes the
T1 iterator, T1 is then advanced to at least the position pointed to by T2,
and so forth.
Assume that eventually one finds a document containing both T1 and T2. Then,
the iterator over T3 is advanced to at least that position. If it lands on
the same document containing the first two terms, that document can be fully
scored and entered into a heap (it surely qualifies as a candidate). T1 is
then advanced again. However, if T3 jumped over the document containing T1
and T2, the engine tries to find a new alignment by advancing T1 to a
position at or beyond that of T3. At the end of the evaluation, the heap is
emptied (either partially or fully)..

So, long story short, does anyone know which of these Lucene uses? I'm
looking into how easy it would be to implement a streaming search in Lucene
(where a document is returned as it is found, instead of returning a list of
doc ids and iterating over them), and Document-At-A-Time processing would be
required for this. Does anyone know if this has already been implemented, or
is planned for an upcoming release?

Thanks

Shane

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message