lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Looking For Tokenizer With Custom Delimeter
Date Mon, 08 Jan 2018 10:53:17 GMT
Moin,

Plain easy to do customize with lambdas! E.g., an elegant way to create a tokenizer which
behaves exactly as WhitespaceTokenizer and LowerCaseFilter is:

Tokenizer tok = CharTokenizer.fromSeparatorCharPredicate(Character::isWhitespace, Character::toLowerCase);

Adjust with Lambdas and you can create any tokenizer based on any character check, so to check
for whitespace or underscore:

Tokenizer tok = CharTokenizer.fromSeparatorCharPredicate(ch -> Character.isWhitespace ||
ch == '_');

Uwe

-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Armins Stepanjans [mailto:armins.bagrats@gmail.com]
> Sent: Monday, January 8, 2018 11:30 AM
> To: java-user@lucene.apache.org
> Subject: Looking For Tokenizer With Custom Delimeter
> 
> Hi,
> 
> I am looking for a tokenizer, where I could specify a delimiter by which
> the words are tokenized, for example if I choose the delimiters as ' ' and
> '_' the following string:
> "foo__bar doo"
> would be tokenized into:
> "foo", "", "bar", "doo"
> (The analyzer could further filter empty tokens, since having the empty
> string token is not critical).
> 
> Is such functionality built into Lucene (I'm working with 7.1.0) and does
> this seem like the correct approach to the problem?
> 
> Regards,
> Armīns


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message