lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera" <ser...@gmail.com>
Subject Re: Is the COMPANY rule in StandardTokenizer valid?
Date Thu, 04 Sep 2008 16:24:06 GMT
>> If I had to choose, this sounds reasonable.
Which of the two sound reasonable: (1) or (2)?

On Thu, Sep 4, 2008 at 3:47 PM, Grant Ingersoll <gsingers@apache.org> wrote:

>
> On Sep 4, 2008, at 2:43 AM, Shai Erera wrote:
>
>  Hi
>>
>> The COMPANY rule in StandardTokenizer is defined like this:
>>
>> // Company names like AT&T and Excite@Home.
>> COMPANY    =  {ALPHA} ("&"|"@") {ALPHA}
>>
>> While this works perfect for AT&T and Excite@Home, it doesn't work well
>> for strings like widget&javascript&html. Now, the latter is obviously
>> wrongly typed, and should have been separated by spaces, but that's what a
>> user typed in a document, and now we need to treat it right (why don't they
>> understand the rules of IR and tokenization?). Normally I wouldn't care and
>> say this is one of the extreme cases, but unfortunately the tokenizer output
>> two tokens: widget&javascript and html. Now that bothers me - the user can
>> search for "html" and find the document, but not "javascript" or "widget",
>> which is a bit harder to explain to users, even the intelligent ones.
>>
>> That got me thinking on whether this rule is properly defined, and what's
>> the purpose of it. Obviously it's an attempt to not break legal company
>> names on "&" and "@", but I'm not sure it covers all company name formats.
>> For example, AT&T can be written as "AT & T" (with spaces) and I've also
>> seen cases where it's written as ATT.
>>
>> While you could say "it's a best effort case", users don't buy that.
>> Either you do something properly (doesn't have to be 100% accurate though),
>> or you don't do it at all (I hope that doesn't sound too harsh). That way
>> it's easy to explain to your users that you simply break on "&" or "@"
>> (unless it's an email). They may not like it, but you'll at least be
>> consistent.
>>
>
> I do think that is a bit harsh.  You can hardly expect the computer to be
> perfect when humans aren't either.  There are plenty of cases where two
> people won't agree on what is proper either.  This stuff is always a
> balancing act.
>
> I do, however, think this goes beyond COMPANY, and covers ACRYONYM (to a
> lesser extent) and HOST as well (See also LUCENE-1373), and that we
> shouldn't be in the game of implying semantic meaning from
> StandardTokenizer/Filter all together.  That is, my bigger concern is that
> the tokenizer labels things as COMPANY or ACRONYM or HOST at all, or better
> put, that users assume those types have any meaning outside of the fact that
> they are simple labels that are a bit easier to understand than TOKEN_TYPE_2
> or something like that.
>
>
>
>>
>> This rule slows StandardTokenizer's tokenization time, and eventually does
>> not produce consistent results. If we think it's important to detect these
>> tokens, then let's at least make it consistent by either:
>>
>> - changing the rule to {ALPHA} (("&"|"@") {ALPHA})+, thereby recognizing
>> "AT&T", and "widget&javascript&html" as COMPANY. That at least will allow
>> developers to put a CompanyTokenFilter (for example) after the tokenizer to
>> break on "&" and "@" whenever there are more than 2 parts. We could also
>> modify StandardFilter (which already handles ACRONYM) to handle COMPANY that
>> way.
>>
>> - changing the rule to {ALPHA} ("&"|"@") {ALPHA} ({P} | "!" | "?") so that
>> we recognize company names only if the pattern is followed by a space, dot,
>> dash, underscore, exclamation mark or question mark. That'll still recognize
>> AT&T, but won't recognize widget&javascript&html as COMPANY (which is
good).
>>
>
> If I had to choose, this sounds reasonable.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Mime
View raw message