lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: ICUFoldingFilter
Date Tue, 05 Jun 2018 01:12:10 GMT
There may be a traps, e.g. if you make such a filter with UnicodeSet,
I think you really need to call .freeze() before passing it to this
thing. I have not examined the sources in a while but I think this
might be similar to "compiling a regexp" in that you'll then get good
performance when its later called millions of times.

If you use the factories, it will do this for you. But if you use the
API directly it is currently a bit of a performance trap...

On Mon, Jun 4, 2018 at 2:49 PM, Michael Sokolov <msokolov@gmail.com> wrote:
> Ah thanks! That's very good to know. As it is I realized we already have an
> earlier component where we can handle this (we have a custom ICUTokenizer
> rbbi and can just split on "^"). So many flexibility
>
> -Mike
>
> On Mon, Jun 4, 2018 at 10:53 AM, Robert Muir <rcmuir@gmail.com> wrote:
>
>> actually, you now can choose to ignore certain characters by using
>> unicode filtering mechanism.
>>
>> This was added in https://issues.apache.org/jira/browse/LUCENE-8129
>>
>> So apply a filter such as [^\^] and the filter will ignore ^.
>>
>> On Mon, Jun 4, 2018 at 10:41 AM, Robert Muir <rcmuir@gmail.com> wrote:
>> > This cannot be "tweaked" at runtime, it is implemented as custom
>> normalization.
>> >
>> > You can modify the sources / build your own ruleset or use a different
>> > tokenfilter to normalize characters.
>> >
>> > On Mon, Jun 4, 2018 at 9:07 AM, Michael Sokolov <msokolov@gmail.com>
>> wrote:
>> >> Hi, I'm using ICUFoldingFilter and for the most part it does exactly
>> what I
>> >> want. However there are some behaviors I'd like to tweak. For example it
>> >> maps "aaa^bbb" to "aaabbb". I am trying to understand why it does that,
>> and
>> >> whether there is any way to prevent it.
>> >>
>> >> I spent a little time with
>> >> http://www.unicode.org/reports/tr30/tr30-4.html#UnicodeData which I
>> guess
>> >> is the basis for what this filter does (it's referenced in the
>> javadocs),
>> >> but that didn't answer my questions. As an aside, it seems this tech
>> report
>> >> was withdfrawn by the unicode consortium? Not sure what that means if
>> >> anything, but it seems ominous.
>> >>
>> >> Anyway, I would appreciate pointers to more info, and specifically,
>> whether
>> >> there are any alternatives to the utr30.nrm data file, or any
>> possibility
>> >> to select among the many transformations this filter applies.
>> >>
>> >> Thanks!
>> >>
>> >> Mike S
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message