lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera" <>
Subject Is the COMPANY rule in StandardTokenizer valid?
Date Thu, 04 Sep 2008 06:43:07 GMT

The COMPANY rule in StandardTokenizer is defined like this:

// Company names like AT&T and Excite@Home.
COMPANY    =  {ALPHA} ("&"|"@") {ALPHA}

While this works perfect for AT&T and Excite@Home, it doesn't work well for
strings like widget&javascript&html. Now, the latter is obviously wrongly
typed, and should have been separated by spaces, but that's what a user
typed in a document, and now we need to treat it right (why don't they
understand the rules of IR and tokenization?). Normally I wouldn't care and
say this is one of the extreme cases, but unfortunately the tokenizer output
two tokens: widget&javascript and html. Now that bothers me - the user can
search for "html" and find the document, but not "javascript" or "widget",
which is a bit harder to explain to users, even the intelligent ones.

That got me thinking on whether this rule is properly defined, and what's
the purpose of it. Obviously it's an attempt to not break legal company
names on "&" and "@", but I'm not sure it covers all company name formats.
For example, AT&T can be written as "AT & T" (with spaces) and I've also
seen cases where it's written as ATT.

While you could say "it's a best effort case", users don't buy that. Either
you do something properly (doesn't have to be 100% accurate though), or you
don't do it at all (I hope that doesn't sound too harsh). That way it's easy
to explain to your users that you simply break on "&" or "@" (unless it's an
email). They may not like it, but you'll at least be consistent.

This rule slows StandardTokenizer's tokenization time, and eventually does
not produce consistent results. If we think it's important to detect these
tokens, then let's at least make it consistent by either:

- changing the rule to {ALPHA} (("&"|"@") {ALPHA})+, thereby recognizing
"AT&T", and "widget&javascript&html" as COMPANY. That at least will allow
developers to put a CompanyTokenFilter (for example) after the tokenizer to
break on "&" and "@" whenever there are more than 2 parts. We could also
modify StandardFilter (which already handles ACRONYM) to handle COMPANY that

- changing the rule to {ALPHA} ("&"|"@") {ALPHA} ({P} | "!" | "?") so that
we recognize company names only if the pattern is followed by a space, dot,
dash, underscore, exclamation mark or question mark. That'll still recognize
AT&T, but won't recognize widget&javascript&html as COMPANY (which is good).

What do you think?


View raw message