Return-Path: Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: (qmail 78019 invoked from network); 8 May 2010 23:33:20 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 8 May 2010 23:33:20 -0000 Received: (qmail 23805 invoked by uid 500); 8 May 2010 23:33:19 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 23764 invoked by uid 500); 8 May 2010 23:33:19 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 23757 invoked by uid 99); 8 May 2010 23:33:19 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 08 May 2010 23:33:19 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of goksron@gmail.com designates 209.85.221.191 as permitted sender) Received: from [209.85.221.191] (HELO mail-qy0-f191.google.com) (209.85.221.191) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 08 May 2010 23:33:11 +0000 Received: by qyk29 with SMTP id 29so4015546qyk.14 for ; Sat, 08 May 2010 16:32:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=uAMwH9df7rc+oDkhHqJbS/PaBs/cGWpyRjcpBQ6VgJ8=; b=prA8CE5oU7d0v3plWXxSfL5mS0y4nNqIZ4/rmtDvs1jziiaApY2CDQCq3XGKsC5Omx dZz5HIIayz7VMFAZjWUF0nAhmwdy3OsKVuMoHwXkCvtbs/Nyl7v9Wub6vYnZTCO6eD5g Rd9jc3MzQxtFXBxlF/3R6CnUnR3vo2wi+y8uY= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=u2/qrPPOzQ5r+7CcOwslUIiPuaBWS0xORHM8RCjKfGSneDUZeH9tCRNgaxrRFMI06j J8rpE/Vy2FkJbksL1QASWHTSF31uqFn+i+b9xAHxmXHw4sibTap+BLJHdXAZ73ID8w6O YS60oukOFNubjH+PkwByJxIbY1jvkTRSLd9Uc= MIME-Version: 1.0 Received: by 10.229.217.130 with SMTP id hm2mr1200681qcb.15.1273361570335; Sat, 08 May 2010 16:32:50 -0700 (PDT) Received: by 10.229.217.144 with HTTP; Sat, 8 May 2010 16:32:50 -0700 (PDT) In-Reply-To: <4BE5A14C.6040108@gmail.com> References: <796818.71994.qm@web29007.mail.ird.yahoo.com> <4BE51387.7050304@getopt.org> <6F4496E2-F0F2-4732-A4C3-788C71BC0C53@yahoo.co.uk> <4BE5A14C.6040108@gmail.com> Date: Sat, 8 May 2010 16:32:50 -0700 Message-ID: Subject: Re: Adding another dimension to Lucene searches From: Lance Norskog To: dev@lucene.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org There are two separate problems that I know of in indexing parts of PDFs in an overlapping way: 1) block-structured documents of a) the entire PDF file b) chapters c) sections of chapters d.....z) 2) Tracking the set of pages that each document contains. As I understand this, LUCENE-2324 handles the first case but not the second. True? On Sat, May 8, 2010 at 10:37 AM, Michael Busch wrote: > On 5/8/10 3:10 AM, Mark Harwood wrote: >> >> The downside is the need to maintain sequences of related docs in the sa= me >> segment - something Lucene currently doesn't make easy with its limited >> control over when segments are flushed. I suspect we'll need some discus= sion >> on how best to support this. >> > > LUCENE-2324 should help to make this work even when you add documents wit= h > multiple threads. =C2=A0There will be one DocumentsWriter per thread (DWP= T), and > the different DWPTs will write to their own segments. =C2=A0We will also = have an > extension point to control thread binding. =C2=A0Then you can make sure t= hat all > parts of your compound document end up sequentially in the same segment. > > One thing we have to make sure though is that a DWPT doesn't flush "betwe= en" > different parts of your compound doc. =C2=A0Hmm, we might have to add a "= flush > policy" to our growing family of policies. > > =C2=A0Michael > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org > For additional commands, e-mail: dev-help@lucene.apache.org > > --=20 Lance Norskog goksron@gmail.com --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org