lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <>
Subject Re: tokenizer to strip a set of characters
Date Thu, 21 Nov 2013 15:46:51 GMT
The word delimiter filter has the ability to pass a table which specifies 
the type for a character:,

byte[], int, org.apache.lucene.analysis.util.CharArraySet)

There is also a regex token filter that you could use to make fine 
adjustments, like character allowed within tokens but ignored at the start 
or end.

-- Jack Krupansky

-----Original Message----- 
From: Stephane Nicoll
Sent: Thursday, November 21, 2013 9:42 AM
Subject: tokenizer to strip a set of characters


I am using lucene 3.6 and I am looking to a tokenized that would remove
certain characters when they are present at the beginning or at the end of
a token.

I initially used the StandardAnalyzer and switched to the
WhitespaceAnalyser because it was too agressive for my use case.

A few examples:

   - foo, -> foo (comma at the end)
   - foo. -> foo (period at the end)
   - foo!!!! -> foo
   - foo?! -> foo
   - ,foo -> foo (comma at the beginning of a word is a typo mistake but
   should be handled-

Is there a configurable tokenizer I could use for this?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message