lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Sokolov <msoko...@gmail.com>
Subject Re: ICUFoldingFilter
Date Tue, 05 Jun 2018 14:40:53 GMT
That's good to know. If we go this route, we'll definitely either use the
factory, or follow its example. Thanks again

-Mike

On Mon, Jun 4, 2018 at 9:12 PM, Robert Muir <rcmuir@gmail.com> wrote:

> There may be a traps, e.g. if you make such a filter with UnicodeSet,
> I think you really need to call .freeze() before passing it to this
> thing. I have not examined the sources in a while but I think this
> might be similar to "compiling a regexp" in that you'll then get good
> performance when its later called millions of times.
>
> If you use the factories, it will do this for you. But if you use the
> API directly it is currently a bit of a performance trap...
>
> On Mon, Jun 4, 2018 at 2:49 PM, Michael Sokolov <msokolov@gmail.com>
> wrote:
> > Ah thanks! That's very good to know. As it is I realized we already have
> an
> > earlier component where we can handle this (we have a custom ICUTokenizer
> > rbbi and can just split on "^"). So many flexibility
> >
> > -Mike
> >
> > On Mon, Jun 4, 2018 at 10:53 AM, Robert Muir <rcmuir@gmail.com> wrote:
> >
> >> actually, you now can choose to ignore certain characters by using
> >> unicode filtering mechanism.
> >>
> >> This was added in https://issues.apache.org/jira/browse/LUCENE-8129
> >>
> >> So apply a filter such as [^\^] and the filter will ignore ^.
> >>
> >> On Mon, Jun 4, 2018 at 10:41 AM, Robert Muir <rcmuir@gmail.com> wrote:
> >> > This cannot be "tweaked" at runtime, it is implemented as custom
> >> normalization.
> >> >
> >> > You can modify the sources / build your own ruleset or use a different
> >> > tokenfilter to normalize characters.
> >> >
> >> > On Mon, Jun 4, 2018 at 9:07 AM, Michael Sokolov <msokolov@gmail.com>
> >> wrote:
> >> >> Hi, I'm using ICUFoldingFilter and for the most part it does exactly
> >> what I
> >> >> want. However there are some behaviors I'd like to tweak. For
> example it
> >> >> maps "aaa^bbb" to "aaabbb". I am trying to understand why it does
> that,
> >> and
> >> >> whether there is any way to prevent it.
> >> >>
> >> >> I spent a little time with
> >> >> http://www.unicode.org/reports/tr30/tr30-4.html#UnicodeData which I
> >> guess
> >> >> is the basis for what this filter does (it's referenced in the
> >> javadocs),
> >> >> but that didn't answer my questions. As an aside, it seems this tech
> >> report
> >> >> was withdfrawn by the unicode consortium? Not sure what that means
if
> >> >> anything, but it seems ominous.
> >> >>
> >> >> Anyway, I would appreciate pointers to more info, and specifically,
> >> whether
> >> >> there are any alternatives to the utr30.nrm data file, or any
> >> possibility
> >> >> to select among the many transformations this filter applies.
> >> >>
> >> >> Thanks!
> >> >>
> >> >> Mike S
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message