lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <markrmil...@gmail.com>
Subject Re: Phrase search using quotes -- special Tokenizer
Date Fri, 01 Sep 2006 13:58:11 GMT
Philip Brown wrote:
> Do you mean StandardTokenizer.jj (org.apache.lucene.analysis.standard)?  I'm
> not seeing StandardAnalyzer.jj in the Lucene source download.
>                                                                                     
              
>
> Mark Miller-5 wrote:
>   
>> Philip Brown wrote:
>>     
>>> Hi,
>>>
>>> After running some tests using the StandardAnalyzer, and getting 0
>>> results
>>> from the search, I believe I need a special Tokenizer/Analyzer.  Does
>>> anybody have something that parses like the following:
>>>
>>> - doesn't parse apart phrases (in quotes)
>>> - doesn't parse/separate hyphentated or underscored words
>>> other normal stuff like
>>> - parses on whitespace
>>> - removes periods in acronyms
>>> - lowercases everything (even in quotes? -- maybe)
>>>
>>> I basically have a set of terms, some of which are multi-worded phrases,
>>> but
>>> none should ever be broken apart -- not when adding the documents, not
>>> when
>>> querying the search results, etc.  I'm creating the field in the
>>> documents
>>> as UN_TOKENIZED and using a StandardAnalyzer and basic Query object to
>>> get
>>> the results.  Any suggestions and/or existing code that I could re-use to
>>> fit this purpose?
>>>
>>> Thanks.
>>>   
>>>       
>> Here is what I would do. Pull the Standard Analyzer out of Lucene. 
>> Modify StandardAnalyzer.jj. This is a JavaCC file. In it, there is some 
>> regex that defines tokens for parsing. Now try some steps similar to 
>> this: add '_' and '-' to the definition of a letter. Add a  new token 
>> type that eats quoted phrases...look at queryparser.jj for an example, 
>> prob about half way down the file <QUOTED>. Now run JavaCC on the 
>> StandardAnalyzer.jj. Search the mailing list when you find out that a 
>> ParseException is screwing up compilation (I really wish someone would 
>> update that for the latest JavaCC if indeed that is the problem. Its 
>> really annoying, and excluding it from compilation doesn't seem to fix 
>> it anymore).
>>
>> - Mark
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>>     
>
>   
Yes. Standard Tokenizer. Sorry about that...my brain is schizo. 
StandardTokenizer.jj in the StandardAnazlyer package.

- Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message