lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: ICUFoldingFilter
Date Mon, 04 Jun 2018 14:53:49 GMT
actually, you now can choose to ignore certain characters by using
unicode filtering mechanism.

This was added in https://issues.apache.org/jira/browse/LUCENE-8129

So apply a filter such as [^\^] and the filter will ignore ^.

On Mon, Jun 4, 2018 at 10:41 AM, Robert Muir <rcmuir@gmail.com> wrote:
> This cannot be "tweaked" at runtime, it is implemented as custom normalization.
>
> You can modify the sources / build your own ruleset or use a different
> tokenfilter to normalize characters.
>
> On Mon, Jun 4, 2018 at 9:07 AM, Michael Sokolov <msokolov@gmail.com> wrote:
>> Hi, I'm using ICUFoldingFilter and for the most part it does exactly what I
>> want. However there are some behaviors I'd like to tweak. For example it
>> maps "aaa^bbb" to "aaabbb". I am trying to understand why it does that, and
>> whether there is any way to prevent it.
>>
>> I spent a little time with
>> http://www.unicode.org/reports/tr30/tr30-4.html#UnicodeData which I guess
>> is the basis for what this filter does (it's referenced in the javadocs),
>> but that didn't answer my questions. As an aside, it seems this tech report
>> was withdfrawn by the unicode consortium? Not sure what that means if
>> anything, but it seems ominous.
>>
>> Anyway, I would appreciate pointers to more info, and specifically, whether
>> there are any alternatives to the utr30.nrm data file, or any possibility
>> to select among the many transformations this filter applies.
>>
>> Thanks!
>>
>> Mike S

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message