lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Rochkind <>
Subject Solr, ICUTokenizer with Latin-break-only-on-whitespace
Date Thu, 20 Jun 2013 19:26:33 GMT
(to solr-user, CC'ing author I'm responding to)

I found the solr-user listserv contribution at:

Which explain a way you can supply custom rulefiles to ICUTokenizer, in 
this case to tell it to only break on whitespace for Latin character 

I am trying to use the technique explained there in Solr 4.3, but either 
it's not working, or it's not doing what I'd expect.

I want, for instance, "C++ Language" to be tokenized into "C++", 
"Language".  But the ICUTokenizer, even with the 
rulefiles="Latn:Latin-break-only-on-whitespace.rbbi", with the rbbi file 
from the Solr 4.3 source [1].

But the ICUTokenizer, even with the that rulefile, is still stripping 
the punctuation, and tokenizing that into "C", "Language".

Can anyone give me any guidance or hints? I don't entirely understand 
the semantics of the rbbi file to try debugging there. Is something not 
working, or does the rbbi file just not express the semantics I want?

Thanks for any tips.


View raw message