lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Improve OOTB behavior: English word-splitting should default to autoGeneratePhraseQueries=true
Date Tue, 14 Aug 2012 16:46:42 GMT
I had forgotten about this but I agree it could also be used to handle
challenging tokenizations.

In general I think our Tokenizers should throw away as little
information as possible (at least have options to do so).  Subsequent
TokenFilters can always remove things ...

I agree there's a risk of junk getting into indices ... but setting
appropriate defaults should address this.

Mike McCandless

http://blog.mikemccandless.com

On Tue, Aug 14, 2012 at 12:26 PM, Steven A Rowe <sarowe@syr.edu> wrote:
> Another possibility that would increase customizability via exposing information we currently
throw away, proposed by Mike McCandless on LUCENE-3940[1] (though controversially[2]): in
addition to tokenizing alpha/numeric char sequences, StandardTokenizer could also tokenize
everything else.
>
> Then a NonAlphaNumericStopFilter could remove tokens with types other than <NUM>
or <ALPHANUM>.
>
> As an alternative to NonAlphaNumericStopFilter, a separate WordDelimiterFilter-like filter
could instead generate synonyms like "wi-fi" and "wifi" when it sees the token sequence ("wi"<ALPHANUM>,
"-"<PUNCT>, "fi"<ALPHANUM>).
>
> Positions would need to be addressed.  I assume the default behavior would be to remove
position holes when non-alphanumeric tokens are stopped.  (In fact, I can't think of any use
case that would benefit from position holes for stopped non-alphanumeric tokens.)
>
> AFAICT, Robert Muir's objection to enabling this kind of thing[2] is that people would
use such a tokenizer in default (don't-throw-anything-away) mode, and as a result, unwittingly
put tons of junk tokens in their indexes.  Maybe this concern could be addressed by making
the default behavior the same as it is today, and providing the don't-throw-anything-away
behavior as a non-default option?  Standard*Analyzer* would then remain exactly as it is today,
and wouldn't need to include a NonAlphaNumericStopFilter.
>
> Steve
>
> [1] Mike McCandless's post on LUCENE-3940 <https://issues.apache.org/jira/browse/LUCENE-3940?focusedCommentId=13243299&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13243299>
>
> [2] Robert Muir's subsequent post on LUCENE-3940 <https://issues.apache.org/jira/browse/LUCENE-3940?focusedCommentId=13244124&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13244124>
>
>
> -----Original Message-----
> From: Robert Muir [mailto:rcmuir@gmail.com]
> Sent: Tuesday, August 14, 2012 1:27 AM
> To: dev@lucene.apache.org
> Subject: Re: Improve OOTB behavior: English word-splitting should default to autoGeneratePhraseQueries=true
>
> On Mon, Aug 13, 2012 at 1:58 PM, Chris Hostetter
> <hossman_lucene@fucit.org> wrote:
>>
>> : >         http://unicode.org/reports/tr29/#Word_Boundaries
>> : >
>> : > ...I think it would be a good idea to add some new customization options
>> : > to StandardTokenizer (and StandardTokenizerFactory) to "tailor" the
>> : > behavior based on the various "tailored improvement" notes...
>>
>>
>> : Use a CharFilter.
>>
>> can you elaborate on how you would suggest implenting these "tailored
>> improvements" using a CharFilter?
>
> Generally the easiest way is to replace your ambiguous character (such
> as your hyphen-minus) with what your domain-specific knowledge tells
> you it should be.
> If you are indexing a dictionary where this ambiguous hyphen-minus is
> being used to separate syllables, then replace it with \u2027
> (hyphenation point), and it won't trigger word boundaries.
>
> But it really depends on how you want your whole analysis process to
> work. e.g. in the above example if you want to treat "foo-bar" as
> really equivalent to foobar, or you want to treat U.S.A as equivalent
> to USA, because thats how you want your search to work, then I would
> just replace with U+2060 word joiner. follow through with NFKC_CF
> unicode normalization filter in the icu package which will remove
> this, since its Format.
>
> So I think you can handle all of your cases there with a simple regex
> charfilter, substituting the correct 'semantics' depending on
> ultimately how you want it to work, and then just apply nfkc_cf at the
> end.
>
> As far as the last example, no need for the tokenizer to be involved.
> We already have elisionfilter for this, and italian and french
> analyzers use it to remove a default (but configurable) set of
> contractions. The solr example for these languages is setup with
> these, too.
>
> If you really don't like these dead-simple approaches, then just use
> the tokenizer in the ICU package, which is more flexible than the
> jflex implementation: lets you supply custom grammars at runtime, and
> can split by script, etc, etc.
>
>
> --
> lucidworks.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message