lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <>
Subject Re: Adding another dimension to Lucene searches
Date Sat, 08 May 2010 23:32:50 GMT
There are two separate problems that I know of in indexing parts of
PDFs in an overlapping way:

1) block-structured documents of
   a) the entire PDF file
   b) chapters
   c) sections of chapters
2)   Tracking the set of pages that each document contains.

As I understand this, LUCENE-2324 handles the first case but not the
second. True?

On Sat, May 8, 2010 at 10:37 AM, Michael Busch <> wrote:
> On 5/8/10 3:10 AM, Mark Harwood wrote:
>> The downside is the need to maintain sequences of related docs in the same
>> segment - something Lucene currently doesn't make easy with its limited
>> control over when segments are flushed. I suspect we'll need some discussion
>> on how best to support this.
> LUCENE-2324 should help to make this work even when you add documents with
> multiple threads.  There will be one DocumentsWriter per thread (DWPT), and
> the different DWPTs will write to their own segments.  We will also have an
> extension point to control thread binding.  Then you can make sure that all
> parts of your compound document end up sequentially in the same segment.
> One thing we have to make sure though is that a DWPT doesn't flush "between"
> different parts of your compound doc.  Hmm, we might have to add a "flush
> policy" to our growing family of policies.
>  Michael
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Lance Norskog

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message