lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <>
Subject Re: Phrase search using quotes -- special Tokenizer
Date Fri, 01 Sep 2006 11:52:45 GMT
Philip Brown wrote:
> Hi,
> After running some tests using the StandardAnalyzer, and getting 0 results
> from the search, I believe I need a special Tokenizer/Analyzer.  Does
> anybody have something that parses like the following:
> - doesn't parse apart phrases (in quotes)
> - doesn't parse/separate hyphentated or underscored words
> other normal stuff like
> - parses on whitespace
> - removes periods in acronyms
> - lowercases everything (even in quotes? -- maybe)
> I basically have a set of terms, some of which are multi-worded phrases, but
> none should ever be broken apart -- not when adding the documents, not when
> querying the search results, etc.  I'm creating the field in the documents
> as UN_TOKENIZED and using a StandardAnalyzer and basic Query object to get
> the results.  Any suggestions and/or existing code that I could re-use to
> fit this purpose?
> Thanks.
Here is what I would do. Pull the Standard Analyzer out of Lucene. 
Modify StandardAnalyzer.jj. This is a JavaCC file. In it, there is some 
regex that defines tokens for parsing. Now try some steps similar to 
this: add '_' and '-' to the definition of a letter. Add a  new token 
type that eats quoted phrases...look at queryparser.jj for an example, 
prob about half way down the file <QUOTED>. Now run JavaCC on the 
StandardAnalyzer.jj. Search the mailing list when you find out that a 
ParseException is screwing up compilation (I really wish someone would 
update that for the latest JavaCC if indeed that is the problem. Its 
really annoying, and excluding it from compilation doesn't seem to fix 
it anymore).

- Mark

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message