lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: WordDelimiterGraphFilter does not respect KeywordAttribute
Date Sat, 21 Apr 2018 19:56:00 GMT
+1

Mike

On Fri, Apr 20, 2018, 9:42 AM Michael Sokolov <msokolov@gmail.com> wrote:

> I have a use case that generates some tokens containing punctuation
> (fractions and other numerical constructs), but I am handling most
> punctuation with WordDelimiterGraphFilter, which then decomposes those
> tokens into parts and re-composes, so eg 1/2 becomes {1, 2, 12}. I thought
> at first that I could mark those tokens as keywords to prevent any future
> analysis, but I discovered WDGF ignores that.
>
> I have a workaround using Arabic numerals as separators instead of
> punctuation (1/2 -> 1١2) -- these are classified as digits, so WDGF does
> not split on them --, but someday I would like to support Arabic (or Hindi)
> language numbers as well, and then this hack will bite me.
>
> Does it seem reasonable to update WDGF (and its cousin WDF) to respect
> KeywordAttribute? I think it can be done with a very small change.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message