Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 98083 invoked from network); 4 Aug 2010 08:42:45 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 4 Aug 2010 08:42:45 -0000 Received: (qmail 72847 invoked by uid 500); 4 Aug 2010 08:42:43 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 72367 invoked by uid 500); 4 Aug 2010 08:42:38 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 72358 invoked by uid 99); 4 Aug 2010 08:42:37 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Aug 2010 08:42:37 +0000 X-ASF-Spam-Status: No, hits=0.7 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [188.121.53.1] (HELO n1plout04-01.prod.ams1.secureserver.net) (188.121.53.1) by apache.org (qpsmtpd/0.29) with SMTP; Wed, 04 Aug 2010 08:42:28 +0000 Received: (qmail 19239 invoked from network); 4 Aug 2010 08:42:05 -0000 Received: from unknown (95.35.128.248) by n1plout04-01.prod.ams1.secureserver.net (188.121.53.1) with ESMTP; 04 Aug 2010 08:42:05 -0000 Message-ID: <4C5927CE.7090609@code972.com> Date: Wed, 04 Aug 2010 11:41:50 +0300 From: Itamar Syn-Hershko User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.11) Gecko/20100711 Lightning/1.0b1 Thunderbird/3.0.6 MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: Re: get wordno, lineno, pageno for term/phrase References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Storing all that info per-token as payloads will bloat the index. Wouldn't it be wiser to use a special token to mark page feed and end of paragraph (numbers of which could be then stored as payloads), and scan the token stream per document to retrieve them back? some extra operations for retrieval, but much smaller index... Itamar. On 3/8/2010 11:54 PM, Erick Erickson wrote: > No, you can't do this with any existing analyzers I know of. Part > of the problem here is that there's no good generic way to KNOW > what a page and line are. > > Have you investigated payloads? I'm not sure that's a good fit for > this particular problem, but it might be worth investigating. > > Best > Erick > > On Tue, Aug 3, 2010 at 10:58 AM, arun r wrote: > > >> hi all, >> I am new to Lucene. I am trying to use Lucene to generate >> data for a document classifier. I need to generate wordno, lineno, >> pageno for each term/phrase. I was able to use SpanQuery/SpanNearQuery >> to get the wordno (span.start()) for the term/phrase. To get pageno >> and lineno, a custom Analyzer needs to be written ? Can the Analyzer >> be made to recognize and newline and page feed characters and keep >> track of lineno and pageno for the tokens ? >> >> Is it possible with existing Lucene Analyzer ? >> >> Thanks, >> Arun >> >> -- >> Where there is a will, there is a way ! >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >> >> >> > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org