lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Sokolov <msoko...@gmail.com>
Subject Re: ICUFoldingFilter
Date Mon, 04 Jun 2018 18:49:45 GMT
Ah thanks! That's very good to know. As it is I realized we already have an
earlier component where we can handle this (we have a custom ICUTokenizer
rbbi and can just split on "^"). So many flexibility

-Mike

On Mon, Jun 4, 2018 at 10:53 AM, Robert Muir <rcmuir@gmail.com> wrote:

> actually, you now can choose to ignore certain characters by using
> unicode filtering mechanism.
>
> This was added in https://issues.apache.org/jira/browse/LUCENE-8129
>
> So apply a filter such as [^\^] and the filter will ignore ^.
>
> On Mon, Jun 4, 2018 at 10:41 AM, Robert Muir <rcmuir@gmail.com> wrote:
> > This cannot be "tweaked" at runtime, it is implemented as custom
> normalization.
> >
> > You can modify the sources / build your own ruleset or use a different
> > tokenfilter to normalize characters.
> >
> > On Mon, Jun 4, 2018 at 9:07 AM, Michael Sokolov <msokolov@gmail.com>
> wrote:
> >> Hi, I'm using ICUFoldingFilter and for the most part it does exactly
> what I
> >> want. However there are some behaviors I'd like to tweak. For example it
> >> maps "aaa^bbb" to "aaabbb". I am trying to understand why it does that,
> and
> >> whether there is any way to prevent it.
> >>
> >> I spent a little time with
> >> http://www.unicode.org/reports/tr30/tr30-4.html#UnicodeData which I
> guess
> >> is the basis for what this filter does (it's referenced in the
> javadocs),
> >> but that didn't answer my questions. As an aside, it seems this tech
> report
> >> was withdfrawn by the unicode consortium? Not sure what that means if
> >> anything, but it seems ominous.
> >>
> >> Anyway, I would appreciate pointers to more info, and specifically,
> whether
> >> there are any alternatives to the utr30.nrm data file, or any
> possibility
> >> to select among the many transformations this filter applies.
> >>
> >> Thanks!
> >>
> >> Mike S
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message