Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id EA81592C3 for ; Mon, 27 Feb 2012 19:43:18 +0000 (UTC) Received: (qmail 6972 invoked by uid 500); 27 Feb 2012 19:43:16 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 6933 invoked by uid 500); 27 Feb 2012 19:43:16 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 6924 invoked by uid 99); 27 Feb 2012 19:43:16 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Feb 2012 19:43:16 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [67.59.59.114] (HELO trironport1.altair.com) (67.59.59.114) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Feb 2012 19:43:08 +0000 X-IronPort-AV: E=Sophos;i="4.73,491,1325480400"; d="scan'208";a="5728933" Received: from unknown (HELO TR-EXCH07.prog.altair.com) ([204.235.24.175]) by trironport1.altair.com with ESMTP; 27 Feb 2012 14:42:47 -0500 Received: from TR-EXCH07.prog.altair.com ([204.235.24.175]) by TR-EXCH07.prog.altair.com ([204.235.24.175]) with mapi; Mon, 27 Feb 2012 14:41:51 -0500 From: Prakash Reddy Bande To: "java-user@lucene.apache.org" Date: Mon, 27 Feb 2012 14:42:45 -0500 Subject: RE: Customizing indexing of large files Thread-Topic: Customizing indexing of large files Thread-Index: Acz1cKvHLRZ4/zGvS3u4dQ+1jOuJrQAKyIeAAAFy5AAAAGLyAAAH44owAA62Y4A= Message-ID: <7232A3A7C53F634B8E91E65BC7833C511382B1918C@TR-EXCH07.prog.altair.com> References: <7232A3A7C53F634B8E91E65BC7833C511382B1900F@TR-EXCH07.prog.altair.com> <7232A3A7C53F634B8E91E65BC7833C511382B19071@TR-EXCH07.prog.altair.com> <6C78E97C707B5B4C8CC61D44F87545860F1D5C@SUEX10-mbx-03.ad.syr.edu> In-Reply-To: <6C78E97C707B5B4C8CC61D44F87545860F1D5C@SUEX10-mbx-03.ad.syr.edu> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org Hi, Thanks all. So the answer is a custom Reader implementation. I was beating = around the bush with Tokenizer.=20 Regards, =20 Prakash Bande Director - Hyperworks Enterprise Software=20 Altair Eng. Inc.=20 Troy MI Ph: 248-614-2400 ext 489 Cell: 248-404-0292 -----Original Message----- From: Steven A Rowe [mailto:sarowe@syr.edu]=20 Sent: Monday, February 27, 2012 2:16 PM To: java-user@lucene.apache.org Subject: RE: Customizing indexing of large files PatternReplaceCharFilter would probably work, or maybe a custom CharFilter?= *CharFilter has the advantage of preserving original text offsets, for hi= ghlighting. Steve > -----Original Message----- > From: Glen Newton [mailto:glen.newton@gmail.com] > Sent: Monday, February 27, 2012 12:57 PM > To: java-user@lucene.apache.org > Subject: Re: Customizing indexing of large files >=20 > Hi, >=20 > Understood. > Write a custom FileReader that filters out the text you do not want. > This will do it streaming. >=20 > Glen >=20 > On Mon, Feb 27, 2012 at 12:46 PM, Prakash Reddy Bande > wrote: > > Hi, > > > > Description is multiline, in addition there is other text also. So, > essentially what I need id to jump the DATA_END as soon as I hit > DATA_BEGIN. > > > > I am creating the field using the constructor Field(String name, Reader > reader) and using StandardAnalyser. Right now I am using FileReader which > is causing all the text to be indexed/tokenized. > > > > Amount of text I am interested in is also pretty large, description is > just one such example. So, I really want some stream based implementation > to avoid keeping large amount of text in memory. May be a custom > TokenStream, but I don't know what to implement in tokenstream. The only > abstract method is incrementToken, I have no idea what to do in it. > > > > Regards, > > > > Prakash Bande > > Director - Hyperworks Enterprise Software > > Altair Eng. Inc. > > Troy MI > > Ph: 248-614-2400 ext 489 > > Cell: 248-404-0292 > > > > -----Original Message----- > > From: Glen Newton [mailto:glen.newton@gmail.com] > > Sent: Monday, February 27, 2012 12:05 PM > > To: java-user@lucene.apache.org > > Subject: Re: Customizing indexing of large files > > > > I'd suggest writing a perl script or > > insert-favourite-scripting-language-here script to pre-filter this > > content out of the files before it gets to Lucene/Solr > > Or you could just grep for "Data' and"Description" (or is > > 'Description' multi-line)? > > > > -Glen Newton > > > > On Mon, Feb 27, 2012 at 11:55 AM, Prakash Reddy Bande > > wrote: > >> Hi, > >> > >> I want to customize the indexing of some specific kind of files I have= . > I am using 2.9.3 but upgrading is possible. > >> This is how my file's data looks > >> > >> ***************************** > >> Data for 2010 > >> Description: This section has a general description of the data. > >> DATA_BEGIN > >> Month =A0 =A0 =A0 P1 =A0 =A0 =A0 =A0 =A0P2 =A0 =A0 =A0 =A0 =A0P3 > >> 01 =A0 =A0 =A0 =A0 =A03243.433 =A0 =A043534.324 =A0 45345.2443 > >> 02 =A0 =A0 =A0 =A0 =A03242.324 =A0 =A0234234.24 =A0 323.2343 > >> ... > >> ... > >> ... > >> ... > >> DATA_END > >> Data for 2011 > >> Description: This section has a general description of the data. > >> DATA_BEGIN > >> Month =A0 =A0 =A0 P1 =A0 =A0 =A0 =A0 =A0P2 =A0 =A0 =A0 =A0 =A0P3 > >> 01 =A0 =A0 =A0 =A0 =A03243.433 =A0 =A043534.324 =A0 45345.2443 > >> 02 =A0 =A0 =A0 =A0 =A03242.324 =A0 =A0234234.24 =A0 323.2343 > >> ... > >> ... > >> ... > >> ... > >> DATA_END > >> ***************************** > >> > >> I would like to use a StandardAnalyser, but do not want to index the > data of the columns, i.e. skip all those numbers. Basically, as soon as I > hit the keyword DATA_BEGIN, I want to jump to DATA_END. > >> So, what is the best approach? Using a custom Reader, custom tokenizer > or some other mechanism. > >> Regards, > >> > >> Prakash Bande > >> Altair Eng. Inc. > >> Troy MI > >> Ph: 248-614-2400 ext 489 > >> Cell: 248-404-0292 > >> > > > > > > > > -- > > - > > http://zzzoot.blogspot.com/ > > - > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > For additional commands, e-mail: java-user-help@lucene.apache.org > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > For additional commands, e-mail: java-user-help@lucene.apache.org > > >=20 >=20 >=20 > -- > - > http://zzzoot.blogspot.com/ > - >=20 > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org