lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven A Rowe <sar...@syr.edu>
Subject RE: Improve OOTB behavior: English word-splitting should default to autoGeneratePhraseQueries=true
Date Tue, 14 Aug 2012 16:26:19 GMT
Another possibility that would increase customizability via exposing information we currently
throw away, proposed by Mike McCandless on LUCENE-3940[1] (though controversially[2]): in
addition to tokenizing alpha/numeric char sequences, StandardTokenizer could also tokenize
everything else.

Then a NonAlphaNumericStopFilter could remove tokens with types other than <NUM> or
<ALPHANUM>.

As an alternative to NonAlphaNumericStopFilter, a separate WordDelimiterFilter-like filter
could instead generate synonyms like "wi-fi" and "wifi" when it sees the token sequence ("wi"<ALPHANUM>,
"-"<PUNCT>, "fi"<ALPHANUM>).

Positions would need to be addressed.  I assume the default behavior would be to remove position
holes when non-alphanumeric tokens are stopped.  (In fact, I can't think of any use case that
would benefit from position holes for stopped non-alphanumeric tokens.)

AFAICT, Robert Muir's objection to enabling this kind of thing[2] is that people would use
such a tokenizer in default (don't-throw-anything-away) mode, and as a result, unwittingly
put tons of junk tokens in their indexes.  Maybe this concern could be addressed by making
the default behavior the same as it is today, and providing the don't-throw-anything-away
behavior as a non-default option?  Standard*Analyzer* would then remain exactly as it is today,
and wouldn't need to include a NonAlphaNumericStopFilter.

Steve

[1] Mike McCandless's post on LUCENE-3940 <https://issues.apache.org/jira/browse/LUCENE-3940?focusedCommentId=13243299&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13243299>

[2] Robert Muir's subsequent post on LUCENE-3940 <https://issues.apache.org/jira/browse/LUCENE-3940?focusedCommentId=13244124&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13244124>
	

-----Original Message-----
From: Robert Muir [mailto:rcmuir@gmail.com] 
Sent: Tuesday, August 14, 2012 1:27 AM
To: dev@lucene.apache.org
Subject: Re: Improve OOTB behavior: English word-splitting should default to autoGeneratePhraseQueries=true

On Mon, Aug 13, 2012 at 1:58 PM, Chris Hostetter
<hossman_lucene@fucit.org> wrote:
>
> : >         http://unicode.org/reports/tr29/#Word_Boundaries
> : >
> : > ...I think it would be a good idea to add some new customization options
> : > to StandardTokenizer (and StandardTokenizerFactory) to "tailor" the
> : > behavior based on the various "tailored improvement" notes...
>
>
> : Use a CharFilter.
>
> can you elaborate on how you would suggest implenting these "tailored
> improvements" using a CharFilter?

Generally the easiest way is to replace your ambiguous character (such
as your hyphen-minus) with what your domain-specific knowledge tells
you it should be.
If you are indexing a dictionary where this ambiguous hyphen-minus is
being used to separate syllables, then replace it with \u2027
(hyphenation point), and it won't trigger word boundaries.

But it really depends on how you want your whole analysis process to
work. e.g. in the above example if you want to treat "foo-bar" as
really equivalent to foobar, or you want to treat U.S.A as equivalent
to USA, because thats how you want your search to work, then I would
just replace with U+2060 word joiner. follow through with NFKC_CF
unicode normalization filter in the icu package which will remove
this, since its Format.

So I think you can handle all of your cases there with a simple regex
charfilter, substituting the correct 'semantics' depending on
ultimately how you want it to work, and then just apply nfkc_cf at the
end.

As far as the last example, no need for the tokenizer to be involved.
We already have elisionfilter for this, and italian and french
analyzers use it to remove a default (but configurable) set of
contractions. The solr example for these languages is setup with
these, too.

If you really don't like these dead-simple approaches, then just use
the tokenizer in the ICU package, which is more flexible than the
jflex implementation: lets you supply custom grammars at runtime, and
can split by script, etc, etc.


-- 
lucidworks.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Mime
View raw message