lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: Improve OOTB behavior: English word-splitting should default to autoGeneratePhraseQueries=true
Date Tue, 14 Aug 2012 16:46:42 GMT
I had forgotten about this but I agree it could also be used to handle
challenging tokenizations.

In general I think our Tokenizers should throw away as little
information as possible (at least have options to do so).  Subsequent
TokenFilters can always remove things ...

I agree there's a risk of junk getting into indices ... but setting
appropriate defaults should address this.

Mike McCandless

On Tue, Aug 14, 2012 at 12:26 PM, Steven A Rowe <> wrote:
> Another possibility that would increase customizability via exposing information we currently
throw away, proposed by Mike McCandless on LUCENE-3940[1] (though controversially[2]): in
addition to tokenizing alpha/numeric char sequences, StandardTokenizer could also tokenize
everything else.
> Then a NonAlphaNumericStopFilter could remove tokens with types other than <NUM>
> As an alternative to NonAlphaNumericStopFilter, a separate WordDelimiterFilter-like filter
could instead generate synonyms like "wi-fi" and "wifi" when it sees the token sequence ("wi"<ALPHANUM>,
"-"<PUNCT>, "fi"<ALPHANUM>).
> Positions would need to be addressed.  I assume the default behavior would be to remove
position holes when non-alphanumeric tokens are stopped.  (In fact, I can't think of any use
case that would benefit from position holes for stopped non-alphanumeric tokens.)
> AFAICT, Robert Muir's objection to enabling this kind of thing[2] is that people would
use such a tokenizer in default (don't-throw-anything-away) mode, and as a result, unwittingly
put tons of junk tokens in their indexes.  Maybe this concern could be addressed by making
the default behavior the same as it is today, and providing the don't-throw-anything-away
behavior as a non-default option?  Standard*Analyzer* would then remain exactly as it is today,
and wouldn't need to include a NonAlphaNumericStopFilter.
> Steve
> [1] Mike McCandless's post on LUCENE-3940 <>
> [2] Robert Muir's subsequent post on LUCENE-3940 <>
> -----Original Message-----
> From: Robert Muir []
> Sent: Tuesday, August 14, 2012 1:27 AM
> To:
> Subject: Re: Improve OOTB behavior: English word-splitting should default to autoGeneratePhraseQueries=true
> On Mon, Aug 13, 2012 at 1:58 PM, Chris Hostetter
> <> wrote:
>> : >
>> : >
>> : > ...I think it would be a good idea to add some new customization options
>> : > to StandardTokenizer (and StandardTokenizerFactory) to "tailor" the
>> : > behavior based on the various "tailored improvement" notes...
>> : Use a CharFilter.
>> can you elaborate on how you would suggest implenting these "tailored
>> improvements" using a CharFilter?
> Generally the easiest way is to replace your ambiguous character (such
> as your hyphen-minus) with what your domain-specific knowledge tells
> you it should be.
> If you are indexing a dictionary where this ambiguous hyphen-minus is
> being used to separate syllables, then replace it with \u2027
> (hyphenation point), and it won't trigger word boundaries.
> But it really depends on how you want your whole analysis process to
> work. e.g. in the above example if you want to treat "foo-bar" as
> really equivalent to foobar, or you want to treat U.S.A as equivalent
> to USA, because thats how you want your search to work, then I would
> just replace with U+2060 word joiner. follow through with NFKC_CF
> unicode normalization filter in the icu package which will remove
> this, since its Format.
> So I think you can handle all of your cases there with a simple regex
> charfilter, substituting the correct 'semantics' depending on
> ultimately how you want it to work, and then just apply nfkc_cf at the
> end.
> As far as the last example, no need for the tokenizer to be involved.
> We already have elisionfilter for this, and italian and french
> analyzers use it to remove a default (but configurable) set of
> contractions. The solr example for these languages is setup with
> these, too.
> If you really don't like these dead-simple approaches, then just use
> the tokenizer in the ICU package, which is more flexible than the
> jflex implementation: lets you supply custom grammars at runtime, and
> can split by script, etc, etc.
> --
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message