lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Philip Brown <...@us.ibm.com>
Subject Re: Phrase search using quotes -- special Tokenizer
Date Fri, 01 Sep 2006 13:50:01 GMT

Do you mean StandardTokenizer.jj (org.apache.lucene.analysis.standard)?  I'm
not seeing StandardAnalyzer.jj in the Lucene source download.
                                                                                         
         

Mark Miller-5 wrote:
> 
> Philip Brown wrote:
>> Hi,
>>
>> After running some tests using the StandardAnalyzer, and getting 0
>> results
>> from the search, I believe I need a special Tokenizer/Analyzer.  Does
>> anybody have something that parses like the following:
>>
>> - doesn't parse apart phrases (in quotes)
>> - doesn't parse/separate hyphentated or underscored words
>> other normal stuff like
>> - parses on whitespace
>> - removes periods in acronyms
>> - lowercases everything (even in quotes? -- maybe)
>>
>> I basically have a set of terms, some of which are multi-worded phrases,
>> but
>> none should ever be broken apart -- not when adding the documents, not
>> when
>> querying the search results, etc.  I'm creating the field in the
>> documents
>> as UN_TOKENIZED and using a StandardAnalyzer and basic Query object to
>> get
>> the results.  Any suggestions and/or existing code that I could re-use to
>> fit this purpose?
>>
>> Thanks.
>>   
> Here is what I would do. Pull the Standard Analyzer out of Lucene. 
> Modify StandardAnalyzer.jj. This is a JavaCC file. In it, there is some 
> regex that defines tokens for parsing. Now try some steps similar to 
> this: add '_' and '-' to the definition of a letter. Add a  new token 
> type that eats quoted phrases...look at queryparser.jj for an example, 
> prob about half way down the file <QUOTED>. Now run JavaCC on the 
> StandardAnalyzer.jj. Search the mailing list when you find out that a 
> ParseException is screwing up compilation (I really wish someone would 
> update that for the latest JavaCC if indeed that is the problem. Its 
> really annoying, and excluding it from compilation doesn't seem to fix 
> it anymore).
> 
> - Mark
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6098930
Sent from the Lucene - Java Users forum at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message