From java-user-return-46910-apmail-lucene-java-user-archive=lucene.apache.org@lucene.apache.org Wed Aug 04 14:05:49 2010 Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 22467 invoked from network); 4 Aug 2010 14:05:46 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 4 Aug 2010 14:05:46 -0000 Received: (qmail 7613 invoked by uid 500); 4 Aug 2010 14:05:45 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 6133 invoked by uid 500); 4 Aug 2010 14:05:39 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 6123 invoked by uid 99); 4 Aug 2010 14:05:38 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Aug 2010 14:05:38 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of arun.raj@gmail.com designates 209.85.212.176 as permitted sender) Received: from [209.85.212.176] (HELO mail-px0-f176.google.com) (209.85.212.176) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Aug 2010 14:05:33 +0000 Received: by pxi11 with SMTP id 11so3625279pxi.35 for ; Wed, 04 Aug 2010 07:05:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:mime-version:received:in-reply-to :references:from:date:message-id:subject:to:content-type :content-transfer-encoding; bh=kGvHlKrCfvXPzdNBmhvHDimvLd29uSOdKlo+iVBQk/M=; b=rNoQZW6frSxhbRYF6iFaFYBTpLVHguVvYUAgA867wp76hSKxjA1KT9tATCVYdoB1Th ERQsm9beoxKQdfvBzfaaCNCD/aeqKKUl/c+4m7fKOsw9gHVbhWrdqqwcaPfA4hbO6ucz pcAtGv+XPIPD9k/9QblcH+VQGEjx9rNdDi7iQ= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding; b=i1gmAv3CZb1dFkBB/vkBp/RJtn/BT/Ylt17YKVAUdimeYB6Wmd0oDhh0QOsToT+t9T srPgZcVApNOW0zi4EUGoyKj1ao7KRypf/0sOcO0ODVr5NSJ2TypwMvwJBmowRoXxS3P3 R5xJ9+vaax1Ll9+44Zk7UCk81ALc1KPVNLctI= Received: by 10.114.161.20 with SMTP id j20mr10662602wae.167.1280930712923; Wed, 04 Aug 2010 07:05:12 -0700 (PDT) MIME-Version: 1.0 Received: by 10.220.203.130 with HTTP; Wed, 4 Aug 2010 07:04:52 -0700 (PDT) In-Reply-To: References: <4C5927CE.7090609@code972.com> From: arun r Date: Wed, 4 Aug 2010 10:04:52 -0400 Message-ID: Subject: Re: get wordno, lineno, pageno for term/phrase To: java-user@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Thanks for your responses. In this case, retrieval time will be more important than index size. Each document will be indexed separately, and the data (wordno, lineno, pageno) will be extracted for certain terms/phrases for each document and stored. I define linebreak and pagebreak and add them to the text string. int pageBreakAscii =3D 12; String pageBreak =3D new Character ((char) pageBreakAscii).toString(); String lineBreak =3D System.getProperty("line.separator"); Thanks, Arun On Wed, Aug 4, 2010 at 9:25 AM, Erick Erickson wr= ote: > It depends (TM). Yes, it would bloat the index. But nothing in the origin= al > post indicates > that this is a concern. The index could be 10M or 100G, in one case it > matters a lot > and in the other it doesn't. It's also unclear whether query response tim= e > matters > at all or whether this is some sort of batch process that can run overnig= ht > (or whatever). > > One could also store a very special field per document that contained all > the meta-data > one could care about. For instance, the offset of each line, page, > paragraph, etc. That, > combined with the offset data for the word, which is available via the sp= an > queries, > could be what's needed. > > re-scanning the input stream has it's own costs as well, but perhaps they > are the > preferable ones, it all depends on the use-case. > > It seems like it's always a space/speed tradeoff...... > > Best > Erick > > On Wed, Aug 4, 2010 at 4:41 AM, Itamar Syn-Hershko wr= ote: > >> Storing all that info per-token as payloads will bloat the index. Wouldn= 't >> it be wiser to use a special token to mark page feed and end of paragrap= h >> (numbers of which could be then stored as payloads), and scan the token >> stream per document to retrieve them back? some extra operations for >> retrieval, but much smaller index... >> >> Itamar. >> >> >> On 3/8/2010 11:54 PM, Erick Erickson wrote: >> >>> No, you can't do this with any existing analyzers I know of. Part >>> of the problem here is that there's no good generic way to KNOW >>> what a page and line are. >>> >>> Have you investigated payloads? I'm not sure that's a good fit for >>> this particular problem, but it might be worth investigating. >>> >>> Best >>> Erick >>> >>> On Tue, Aug 3, 2010 at 10:58 AM, arun r =A0wrote: >>> >>> >>> >>>> hi all, >>>> =A0 =A0 =A0 =A0 =A0 =A0I am new to Lucene. I am trying to use Lucene t= o generate >>>> data for a document classifier. I need to generate wordno, lineno, >>>> pageno for each term/phrase. I was able to use SpanQuery/SpanNearQuery >>>> to get the wordno (span.start()) for the term/phrase. To get pageno >>>> and lineno, a custom Analyzer needs to be written ? Can the Analyzer >>>> be made to recognize and newline and page feed characters and keep >>>> track of lineno and pageno for the tokens ? >>>> >>>> Is it possible with existing Lucene Analyzer ? >>>> >>>> Thanks, >>>> Arun >>>> >>>> -- >>>> Where there is a will, there is a way ! >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>>> For additional commands, e-mail: java-user-help@lucene.apache.org >>>> >>>> >>>> >>>> >>> >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >> >> > --=20 Where there is a will, there is a way ! --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org