Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (nike.apache.org: local policy)
From: Prakash Reddy Bande <prakashr@altair.com>
To: "java-user@lucene.apache.org" <java-user@lucene.apache.org>
Date: Mon, 27 Feb 2012 14:42:45 -0500
Subject: RE: Customizing indexing of large files
Thread-Topic: Customizing indexing of large files
Thread-Index: Acz1cKvHLRZ4/zGvS3u4dQ+1jOuJrQAKyIeAAAFy5AAAAGLyAAAH44owAA62Y4A=
Message-ID: 
 <7232A3A7C53F634B8E91E65BC7833C511382B1918C@TR-EXCH07.prog.altair.com>
References: 
 <7232A3A7C53F634B8E91E65BC7833C511382B1900F@TR-EXCH07.prog.altair.com>
 <CANL2-4Oi9MNahXQN1=HLtCHNAuMyonwiL=yAGBSWwJsYaQoj-A@mail.gmail.com>
 <7232A3A7C53F634B8E91E65BC7833C511382B19071@TR-EXCH07.prog.altair.com>
 <CANL2-4ONHSgrzuXJsaSsbiuC_aXBKYuPJZ9MXReJm9Pd=bLMrA@mail.gmail.com>
 <6C78E97C707B5B4C8CC61D44F87545860F1D5C@SUEX10-mbx-03.ad.syr.edu>
In-Reply-To: <6C78E97C707B5B4C8CC61D44F87545860F1D5C@SUEX10-mbx-03.ad.syr.edu>
Accept-Language: en-US
Content-Language: en-US
acceptlanguage: en-US
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

Hi,

Thanks all. So the answer is a custom Reader implementation. I was beating =
around the bush with Tokenizer.=20

Regards,
=20
Prakash Bande
Director - Hyperworks Enterprise Software=20
Altair Eng. Inc.=20
Troy MI
Ph: 248-614-2400 ext 489
Cell: 248-404-0292

-----Original Message-----
From: Steven A Rowe [mailto:sarowe@syr.edu]=20
Sent: Monday, February 27, 2012 2:16 PM
To: java-user@lucene.apache.org
Subject: RE: Customizing indexing of large files

PatternReplaceCharFilter would probably work, or maybe a custom CharFilter?=
  *CharFilter has the advantage of preserving original text offsets, for hi=
ghlighting.

Steve

> -----Original Message-----
> From: Glen Newton [mailto:glen.newton@gmail.com]
> Sent: Monday, February 27, 2012 12:57 PM
> To: java-user@lucene.apache.org
> Subject: Re: Customizing indexing of large files
>=20
> Hi,
>=20
> Understood.
> Write a custom FileReader that filters out the text you do not want.
> This will do it streaming.
>=20
> Glen
>=20
> On Mon, Feb 27, 2012 at 12:46 PM, Prakash Reddy Bande
> <prakashr@altair.com> wrote:
> > Hi,
> >
> > Description is multiline, in addition there is other text also. So,
> essentially what I need id to jump the DATA_END as soon as I hit
> DATA_BEGIN.
> >
> > I am creating the field using the constructor Field(String name, Reader
> reader) and using StandardAnalyser. Right now I am using FileReader which
> is causing all the text to be indexed/tokenized.
> >
> > Amount of text I am interested in is also pretty large, description is
> just one such example. So, I really want some stream based implementation
> to avoid keeping large amount of text in memory. May be a custom
> TokenStream, but I don't know what to implement in tokenstream. The only
> abstract method is incrementToken, I have no idea what to do in it.
> >
> > Regards,
> >
> > Prakash Bande
> > Director - Hyperworks Enterprise Software
> > Altair Eng. Inc.
> > Troy MI
> > Ph: 248-614-2400 ext 489
> > Cell: 248-404-0292
> >
> > -----Original Message-----
> > From: Glen Newton [mailto:glen.newton@gmail.com]
> > Sent: Monday, February 27, 2012 12:05 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: Customizing indexing of large files
> >
> > I'd suggest writing a perl script or
> > insert-favourite-scripting-language-here script to pre-filter this
> > content out of the files before it gets to Lucene/Solr
> > Or you could just grep for "Data' and"Description" (or is
> > 'Description' multi-line)?
> >
> > -Glen Newton
> >
> > On Mon, Feb 27, 2012 at 11:55 AM, Prakash Reddy Bande
> > <prakashr@altair.com> wrote:
> >> Hi,
> >>
> >> I want to customize the indexing of some specific kind of files I have=
.
> I am using 2.9.3 but upgrading is possible.
> >> This is how my file's data looks
> >>
> >> *****************************
> >> Data for 2010
> >> Description: This section has a general description of the data.
> >> DATA_BEGIN
> >> Month =A0 =A0 =A0 P1 =A0 =A0 =A0 =A0 =A0P2 =A0 =A0 =A0 =A0 =A0P3
> >> 01 =A0 =A0 =A0 =A0 =A03243.433 =A0 =A043534.324 =A0 45345.2443
> >> 02 =A0 =A0 =A0 =A0 =A03242.324 =A0 =A0234234.24 =A0 323.2343
> >> ...
> >> ...
> >> ...
> >> ...
> >> DATA_END
> >> Data for 2011
> >> Description: This section has a general description of the data.
> >> DATA_BEGIN
> >> Month =A0 =A0 =A0 P1 =A0 =A0 =A0 =A0 =A0P2 =A0 =A0 =A0 =A0 =A0P3
> >> 01 =A0 =A0 =A0 =A0 =A03243.433 =A0 =A043534.324 =A0 45345.2443
> >> 02 =A0 =A0 =A0 =A0 =A03242.324 =A0 =A0234234.24 =A0 323.2343
> >> ...
> >> ...
> >> ...
> >> ...
> >> DATA_END
> >> *****************************
> >>
> >> I would like to use a StandardAnalyser, but do not want to index the
> data of the columns, i.e. skip all those numbers. Basically, as soon as I
> hit the keyword DATA_BEGIN, I want to jump to DATA_END.
> >> So, what is the best approach? Using a custom Reader, custom tokenizer
> or some other mechanism.
> >> Regards,
> >>
> >> Prakash Bande
> >> Altair Eng. Inc.
> >> Troy MI
> >> Ph: 248-614-2400 ext 489
> >> Cell: 248-404-0292
> >>
> >
> >
> >
> > --
> > -
> > http://zzzoot.blogspot.com/
> > -
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>=20
>=20
>=20
> --
> -
> http://zzzoot.blogspot.com/
> -
>=20
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org