lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Byrne <john.by...@propylon.com>
Subject Re: Tokenizer queston: how can I force ? and ! to be separate tokens?
Date Fri, 17 Jul 2009 18:43:07 GMT
Yes, you could even use the WhitespaceTokenizer and then look for the 
symbols in a token filter. You would get [you?] as a single token; your 
job in the token filter is then to store the [?] and return the [you]. 
The next time the token filter is called for the next token, you return 
the [?] that you stored previously.

If you're already using something that's grammar-based (such as 
StandardTokenizer) then you could add the "?" to the grammar as a 
separate token. If you can figure out how to do this from looking at the 
grammar file, then it's probably the simplest way.

-John

Matthew Hall wrote:
> I'd think extending WhiteSpaceTokenizer would be a good place to start.
>
> Then create a new Analyzer that exactly mirrors your current Analyzer, 
> with the exception that it uses your new tokenizer instead of 
> WhiteSpaceTokenizer (Well.. there is of course my assumption that you 
> are using an Analyzer that already uses WhiteSpaceTokenizer... but you 
> likely are)
>
> OBender wrote:
>> Hi All,
>>
>>  
>>
>> I need to make ? and ! characters to be a separate token e.g. to 
>> split [how
>> are you?] in to 4 tokens [how], [are], [you] and [?] what would be 
>> the best
>> way to do this?
>>
>>  
>>
>> Thanks
>>
>>
>>   
>
>
> ------------------------------------------------------------------------
>
>
> No virus found in this incoming message.
> Checked by AVG - www.avg.com 
> Version: 8.5.392 / Virus Database: 270.13.18/2243 - Release Date: 07/17/09 06:08:00
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message