lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera" <>
Subject Re: Is the COMPANY rule in StandardTokenizer valid?
Date Fri, 12 Sep 2008 04:35:02 GMT
So I've been thinking about this more, and I can't seem to reach to any
reasonable conclusion other than removing that rule. I'll explain: COMPANY
identifies AT&T, Excite@Home but it also identifies R&D, AD&D, Q&A all are
not really COMPANY. So there's a semantic error in the name of the rule (I
know we shouldn't refer to the names too strictly, but still).

Don't you think those are really ACRONYMs? After all AT&T is an acronym
originated from American Telephone &
R&D is Research and Development and so on. If identify them as ACRONYMs, it
makes sense to even keep them that way, without removing the "&" character
(so that AD&D does not get converted to ADD for example).

As for Excite@Home, or Barnes@Noble, I see no harm in breaking those into
Excite and Home, and Barnes and Noble. People can still search for the
phrase, but they can also search for Excite and get results. Also, if
misspell Barnes, and write Barns, and you use a SnowballAnalyzer, you'll get
results as well. Of course, Barnes is stemmed to barn, in that case you'll
get irrelevant results to your query, but that's the deal with stemming. We
can also rely on the scoring algorithms to score documents that match the
queries "Barnes&Noble", "Barnes and Noble" and "Barnes Noble" higher than
documents that contain only one of the two. So in effect, by splitting
barnes&noble into two words, we raise the recall and eventhough it might
seem like we hurt precision, I don't think that will be affected by much in
the first 10 results (assuming you have enough documents related to barnes &

Anyway, my proposal is like this:
COMPANY - remove
ACRONYM - add {LETTER}{1,2} "&" {LETTER}. That will identify "AT&T" and

It will also break A&B&C to A&B and C, but I'm not familiar with any such

What do you think? We can do this change in 3.0 so that we don't have to
take care of backward compatibility issues, that is of course if everybody
agree to make the change.

On Thu, Sep 4, 2008 at 7:48 PM, Grant Ingersoll <> wrote:

> Sorry, 2.  I realized after I sent it that my last sentence in the reply
> was ambiguous.
> On Sep 4, 2008, at 12:24 PM, Shai Erera wrote:
> >> If I had to choose, this sounds reasonable.
> Which of the two sound reasonable: (1) or (2)?
> On Thu, Sep 4, 2008 at 3:47 PM, Grant Ingersoll <>wrote:
>> On Sep 4, 2008, at 2:43 AM, Shai Erera wrote:
>>  Hi
>>> The COMPANY rule in StandardTokenizer is defined like this:
>>> // Company names like AT&T and Excite@Home.
>>> COMPANY    =  {ALPHA} ("&"|"@") {ALPHA}
>>> While this works perfect for AT&T and Excite@Home, it doesn't work well
>>> for strings like widget&javascript&html. Now, the latter is obviously
>>> wrongly typed, and should have been separated by spaces, but that's what a
>>> user typed in a document, and now we need to treat it right (why don't they
>>> understand the rules of IR and tokenization?). Normally I wouldn't care and
>>> say this is one of the extreme cases, but unfortunately the tokenizer output
>>> two tokens: widget&javascript and html. Now that bothers me - the user can
>>> search for "html" and find the document, but not "javascript" or "widget",
>>> which is a bit harder to explain to users, even the intelligent ones.
>>> That got me thinking on whether this rule is properly defined, and what's
>>> the purpose of it. Obviously it's an attempt to not break legal company
>>> names on "&" and "@", but I'm not sure it covers all company name formats.
>>> For example, AT&T can be written as "AT & T" (with spaces) and I've also
>>> seen cases where it's written as ATT.
>>> While you could say "it's a best effort case", users don't buy that.
>>> Either you do something properly (doesn't have to be 100% accurate though),
>>> or you don't do it at all (I hope that doesn't sound too harsh). That way
>>> it's easy to explain to your users that you simply break on "&" or "@"
>>> (unless it's an email). They may not like it, but you'll at least be
>>> consistent.
>> I do think that is a bit harsh.  You can hardly expect the computer to be
>> perfect when humans aren't either.  There are plenty of cases where two
>> people won't agree on what is proper either.  This stuff is always a
>> balancing act.
>> I do, however, think this goes beyond COMPANY, and covers ACRYONYM (to a
>> lesser extent) and HOST as well (See also LUCENE-1373), and that we
>> shouldn't be in the game of implying semantic meaning from
>> StandardTokenizer/Filter all together.  That is, my bigger concern is that
>> the tokenizer labels things as COMPANY or ACRONYM or HOST at all, or better
>> put, that users assume those types have any meaning outside of the fact that
>> they are simple labels that are a bit easier to understand than TOKEN_TYPE_2
>> or something like that.
>>> This rule slows StandardTokenizer's tokenization time, and eventually
>>> does not produce consistent results. If we think it's important to detect
>>> these tokens, then let's at least make it consistent by either:
>>> - changing the rule to {ALPHA} (("&"|"@") {ALPHA})+, thereby recognizing
>>> "AT&T", and "widget&javascript&html" as COMPANY. That at least will
>>> developers to put a CompanyTokenFilter (for example) after the tokenizer to
>>> break on "&" and "@" whenever there are more than 2 parts. We could also
>>> modify StandardFilter (which already handles ACRONYM) to handle COMPANY that
>>> way.
>>> - changing the rule to {ALPHA} ("&"|"@") {ALPHA} ({P} | "!" | "?") so
>>> that we recognize company names only if the pattern is followed by a space,
>>> dot, dash, underscore, exclamation mark or question mark. That'll still
>>> recognize AT&T, but won't recognize widget&javascript&html as COMPANY
>>> is good).
>> If I had to choose, this sounds reasonable.
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:

View raw message