lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Rochkind <rochk...@jhu.edu>
Subject Solr, ICUTokenizer with Latin-break-only-on-whitespace
Date Thu, 20 Jun 2013 19:26:33 GMT
(to solr-user, CC'ing author I'm responding to)

I found the solr-user listserv contribution at:

https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201305.mbox/%3C51965E70.6070409@elyograg.org%3E

Which explain a way you can supply custom rulefiles to ICUTokenizer, in 
this case to tell it to only break on whitespace for Latin character 
substrings.

I am trying to use the technique explained there in Solr 4.3, but either 
it's not working, or it's not doing what I'd expect.

I want, for instance, "C++ Language" to be tokenized into "C++", 
"Language".  But the ICUTokenizer, even with the 
rulefiles="Latn:Latin-break-only-on-whitespace.rbbi", with the rbbi file 
from the Solr 4.3 source [1].

But the ICUTokenizer, even with the that rulefile, is still stripping 
the punctuation, and tokenizing that into "C", "Language".

Can anyone give me any guidance or hints? I don't entirely understand 
the semantics of the rbbi file to try debugging there. Is something not 
working, or does the rbbi file just not express the semantics I want?

Thanks for any tips.



[1] 
http://svn.apache.org/viewvc/lucene/dev/tags/lucene_solr_4_3_0/lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/segmentation/Latin-break-only-on-whitespace.rbbi?revision=1479557&view=markup


Mime
View raw message