lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera" <>
Subject Re: Is the COMPANY rule in StandardTokenizer valid?
Date Thu, 04 Sep 2008 16:24:06 GMT
>> If I had to choose, this sounds reasonable.
Which of the two sound reasonable: (1) or (2)?

On Thu, Sep 4, 2008 at 3:47 PM, Grant Ingersoll <> wrote:

> On Sep 4, 2008, at 2:43 AM, Shai Erera wrote:
>  Hi
>> The COMPANY rule in StandardTokenizer is defined like this:
>> // Company names like AT&T and Excite@Home.
>> COMPANY    =  {ALPHA} ("&"|"@") {ALPHA}
>> While this works perfect for AT&T and Excite@Home, it doesn't work well
>> for strings like widget&javascript&html. Now, the latter is obviously
>> wrongly typed, and should have been separated by spaces, but that's what a
>> user typed in a document, and now we need to treat it right (why don't they
>> understand the rules of IR and tokenization?). Normally I wouldn't care and
>> say this is one of the extreme cases, but unfortunately the tokenizer output
>> two tokens: widget&javascript and html. Now that bothers me - the user can
>> search for "html" and find the document, but not "javascript" or "widget",
>> which is a bit harder to explain to users, even the intelligent ones.
>> That got me thinking on whether this rule is properly defined, and what's
>> the purpose of it. Obviously it's an attempt to not break legal company
>> names on "&" and "@", but I'm not sure it covers all company name formats.
>> For example, AT&T can be written as "AT & T" (with spaces) and I've also
>> seen cases where it's written as ATT.
>> While you could say "it's a best effort case", users don't buy that.
>> Either you do something properly (doesn't have to be 100% accurate though),
>> or you don't do it at all (I hope that doesn't sound too harsh). That way
>> it's easy to explain to your users that you simply break on "&" or "@"
>> (unless it's an email). They may not like it, but you'll at least be
>> consistent.
> I do think that is a bit harsh.  You can hardly expect the computer to be
> perfect when humans aren't either.  There are plenty of cases where two
> people won't agree on what is proper either.  This stuff is always a
> balancing act.
> I do, however, think this goes beyond COMPANY, and covers ACRYONYM (to a
> lesser extent) and HOST as well (See also LUCENE-1373), and that we
> shouldn't be in the game of implying semantic meaning from
> StandardTokenizer/Filter all together.  That is, my bigger concern is that
> the tokenizer labels things as COMPANY or ACRONYM or HOST at all, or better
> put, that users assume those types have any meaning outside of the fact that
> they are simple labels that are a bit easier to understand than TOKEN_TYPE_2
> or something like that.
>> This rule slows StandardTokenizer's tokenization time, and eventually does
>> not produce consistent results. If we think it's important to detect these
>> tokens, then let's at least make it consistent by either:
>> - changing the rule to {ALPHA} (("&"|"@") {ALPHA})+, thereby recognizing
>> "AT&T", and "widget&javascript&html" as COMPANY. That at least will allow
>> developers to put a CompanyTokenFilter (for example) after the tokenizer to
>> break on "&" and "@" whenever there are more than 2 parts. We could also
>> modify StandardFilter (which already handles ACRONYM) to handle COMPANY that
>> way.
>> - changing the rule to {ALPHA} ("&"|"@") {ALPHA} ({P} | "!" | "?") so that
>> we recognize company names only if the pattern is followed by a space, dot,
>> dash, underscore, exclamation mark or question mark. That'll still recognize
>> AT&T, but won't recognize widget&javascript&html as COMPANY (which is
> If I had to choose, this sounds reasonable.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

View raw message