lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <>
Subject Custom Tokenizer
Date Thu, 05 Dec 2013 17:05:01 GMT

I have used StandardAnalyzer in my code and it is working fine. One of the challenges that
I face is the fact that, this Analyzer by default tokenizes on some special characters such
as hyphen, apart from the SPACE character.

I want to tokenize only on the SPACE character. Could you please suggest how I can achieve

I got this example when I googled for it. What I want to use is the WhitespaceTokenizer so
that data is not manipulated in anyway. I understand that in this case, searches such as "mechanisms"
won't return results because of the period (.) at the end. I want to then address this by
introducing wild-card searches.

Data: 1097-0215 (i.v) product-123 anti-virus, we investigated the mechanisms. 2266-73 In the
present study
Tokens generated with StandardTokenizer:
[1097-0215] [i.v] [product-123] [anti] [virus] [we] [investigated] [the] [mechanisms] [2266-73]
[In] [the] [present] [study]
Tokens generated with WhiteSpaceTokenizer:
[1097-0215] [(i.v)] [product-123] [anti-virus,] [we] [investigated] [the] [mechanisms.] [2266-73]
[In] [the] [present] [study]
Note: I have tried using the WhitespaceAnalyzer which tokenizes by default ONLY on the space,
but my attempt at performing wildcard searches didn't work as expected. Where as, wildcard
searches worked fine with StandardAnalyzer.

Please provide your inputs.



This message is for information purposes only, it is not a recommendation, advice, offer or
solicitation to buy or sell a product or service nor an official confirmation of any transaction.
It is directed at persons who are professionals and is not intended for retail customer use.
Intended for recipient only. This message is subject to the terms at:

For important disclosures, please see: regarding
market commentary from Barclays Sales and/or Trading, who are active market participants;
and in respect of Barclays Research, including disclosures relating to specific issuers, please


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message