lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Raymond Balm├Ęs" <>
Subject Re: Beginner: Specific indexing
Date Fri, 05 Sep 2008 20:29:42 GMT
I think I'm getting you. But the files I'm  going to parse have many formats
: PDF, HTML, Word.
they don't have a particular structure, memos if you will. But the ones I'm
interested in will have the triplets I described

Yes building a TokenFilter as you suggest should do the job.
I guess my initial idea was to use lucene to give me the Token stream so I
can run this specific filter on them.
But I want to do two contradictory things on the same token stream
+ low-band filter which only will gives the "triplets" (actually I do want
to add synonyms to the keywords too, you  are bringing up a good point, but
I could handle it in the query maybe)
+ high-band filter which indexes the text as normal/general text
(StarndarAnalyzer from lucene)

I just have to pass the token stream twice.
Thx a lot, for helping me to clarify my ideas.

On Fri, Sep 5, 2008 at 7:46 PM, Chris Hostetter <>wrote:

> : Interesting if you are not going to use an analyser... what then ? I'm
> : thinking of using javacc, because I oversimplified somewhat the 3 field
> : string structure, so I need a kind of small grammar for that.
> Well, the specifics of "what else" is in your files is going to be the
> biggest factor in deciding how to find the bits of info you need.
> Let me try to put in perspective for you how your question sounded to me,
> as someone unfamiliar with your specific problem.  the question sounded
> equivilent to if someone had asked;
> "I have a bunch of XML files, some of these XML files contain syntax that
> loks like this...
>   <property name="${keyword}" min="${x}" max="${y}" />
> where ${x} and ${y} are small numbers, and ${keyword} is from a fixed list
> of words.  My idea is to simply build a TokenFilter that will look for
> those... do I have it right ?"
> ...and i would say: "Not really.  Use an XML parser to parse your XML and
> extract your structured data, then add them to your Lucene Document."
> You're files may not be XML, but the basic premise is the same; use
> whatever code makes the most sense to parse whatever file format you are
> dealing with given what you know aboutthe files (not just the parts you
> want, but the other parts as well)
> Where an Analyzer might make sense is if you want to do processing on
> those bits of data after you find them ... stemming your keywords, or
> mapping them to synonyms, etc...
> -Hoss
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message