lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven A Rowe <sar...@syr.edu>
Subject RE: Customizing indexing of large files
Date Mon, 27 Feb 2012 19:15:56 GMT
PatternReplaceCharFilter would probably work, or maybe a custom CharFilter?  *CharFilter has
the advantage of preserving original text offsets, for highlighting.

Steve

> -----Original Message-----
> From: Glen Newton [mailto:glen.newton@gmail.com]
> Sent: Monday, February 27, 2012 12:57 PM
> To: java-user@lucene.apache.org
> Subject: Re: Customizing indexing of large files
> 
> Hi,
> 
> Understood.
> Write a custom FileReader that filters out the text you do not want.
> This will do it streaming.
> 
> Glen
> 
> On Mon, Feb 27, 2012 at 12:46 PM, Prakash Reddy Bande
> <prakashr@altair.com> wrote:
> > Hi,
> >
> > Description is multiline, in addition there is other text also. So,
> essentially what I need id to jump the DATA_END as soon as I hit
> DATA_BEGIN.
> >
> > I am creating the field using the constructor Field(String name, Reader
> reader) and using StandardAnalyser. Right now I am using FileReader which
> is causing all the text to be indexed/tokenized.
> >
> > Amount of text I am interested in is also pretty large, description is
> just one such example. So, I really want some stream based implementation
> to avoid keeping large amount of text in memory. May be a custom
> TokenStream, but I don't know what to implement in tokenstream. The only
> abstract method is incrementToken, I have no idea what to do in it.
> >
> > Regards,
> >
> > Prakash Bande
> > Director - Hyperworks Enterprise Software
> > Altair Eng. Inc.
> > Troy MI
> > Ph: 248-614-2400 ext 489
> > Cell: 248-404-0292
> >
> > -----Original Message-----
> > From: Glen Newton [mailto:glen.newton@gmail.com]
> > Sent: Monday, February 27, 2012 12:05 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: Customizing indexing of large files
> >
> > I'd suggest writing a perl script or
> > insert-favourite-scripting-language-here script to pre-filter this
> > content out of the files before it gets to Lucene/Solr
> > Or you could just grep for "Data' and"Description" (or is
> > 'Description' multi-line)?
> >
> > -Glen Newton
> >
> > On Mon, Feb 27, 2012 at 11:55 AM, Prakash Reddy Bande
> > <prakashr@altair.com> wrote:
> >> Hi,
> >>
> >> I want to customize the indexing of some specific kind of files I have.
> I am using 2.9.3 but upgrading is possible.
> >> This is how my file's data looks
> >>
> >> *****************************
> >> Data for 2010
> >> Description: This section has a general description of the data.
> >> DATA_BEGIN
> >> Month       P1          P2          P3
> >> 01          3243.433    43534.324   45345.2443
> >> 02          3242.324    234234.24   323.2343
> >> ...
> >> ...
> >> ...
> >> ...
> >> DATA_END
> >> Data for 2011
> >> Description: This section has a general description of the data.
> >> DATA_BEGIN
> >> Month       P1          P2          P3
> >> 01          3243.433    43534.324   45345.2443
> >> 02          3242.324    234234.24   323.2343
> >> ...
> >> ...
> >> ...
> >> ...
> >> DATA_END
> >> *****************************
> >>
> >> I would like to use a StandardAnalyser, but do not want to index the
> data of the columns, i.e. skip all those numbers. Basically, as soon as I
> hit the keyword DATA_BEGIN, I want to jump to DATA_END.
> >> So, what is the best approach? Using a custom Reader, custom tokenizer
> or some other mechanism.
> >> Regards,
> >>
> >> Prakash Bande
> >> Altair Eng. Inc.
> >> Troy MI
> >> Ph: 248-614-2400 ext 489
> >> Cell: 248-404-0292
> >>
> >
> >
> >
> > --
> > -
> > http://zzzoot.blogspot.com/
> > -
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> 
> 
> 
> --
> -
> http://zzzoot.blogspot.com/
> -
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

Mime
View raw message