lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Is the COMPANY rule in StandardTokenizer valid?
Date Thu, 04 Sep 2008 16:48:52 GMT
Sorry, 2.  I realized after I sent it that my last sentence in the  
reply was ambiguous.

On Sep 4, 2008, at 12:24 PM, Shai Erera wrote:

> >> If I had to choose, this sounds reasonable.
> Which of the two sound reasonable: (1) or (2)?
> On Thu, Sep 4, 2008 at 3:47 PM, Grant Ingersoll  
> <> wrote:
> On Sep 4, 2008, at 2:43 AM, Shai Erera wrote:
> Hi
> The COMPANY rule in StandardTokenizer is defined like this:
> // Company names like AT&T and Excite@Home.
> COMPANY    =  {ALPHA} ("&"|"@") {ALPHA}
> While this works perfect for AT&T and Excite@Home, it doesn't work  
> well for strings like widget&javascript&html. Now, the latter is  
> obviously wrongly typed, and should have been separated by spaces,  
> but that's what a user typed in a document, and now we need to treat  
> it right (why don't they understand the rules of IR and  
> tokenization?). Normally I wouldn't care and say this is one of the  
> extreme cases, but unfortunately the tokenizer output two tokens:  
> widget&javascript and html. Now that bothers me - the user can  
> search for "html" and find the document, but not "javascript" or  
> "widget", which is a bit harder to explain to users, even the  
> intelligent ones.
> That got me thinking on whether this rule is properly defined, and  
> what's the purpose of it. Obviously it's an attempt to not break  
> legal company names on "&" and "@", but I'm not sure it covers all  
> company name formats. For example, AT&T can be written as "AT &  
> T" (with spaces) and I've also seen cases where it's written as ATT.
> While you could say "it's a best effort case", users don't buy that.  
> Either you do something properly (doesn't have to be 100% accurate  
> though), or you don't do it at all (I hope that doesn't sound too  
> harsh). That way it's easy to explain to your users that you simply  
> break on "&" or "@" (unless it's an email). They may not like it,  
> but you'll at least be consistent.
> I do think that is a bit harsh.  You can hardly expect the computer  
> to be perfect when humans aren't either.  There are plenty of cases  
> where two people won't agree on what is proper either.  This stuff  
> is always a balancing act.
> I do, however, think this goes beyond COMPANY, and covers ACRYONYM  
> (to a lesser extent) and HOST as well (See also LUCENE-1373), and  
> that we shouldn't be in the game of implying semantic meaning from  
> StandardTokenizer/Filter all together.  That is, my bigger concern  
> is that the tokenizer labels things as COMPANY or ACRONYM or HOST at  
> all, or better put, that users assume those types have any meaning  
> outside of the fact that they are simple labels that are a bit  
> easier to understand than TOKEN_TYPE_2 or something like that.
> This rule slows StandardTokenizer's tokenization time, and  
> eventually does not produce consistent results. If we think it's  
> important to detect these tokens, then let's at least make it  
> consistent by either:
> - changing the rule to {ALPHA} (("&"|"@") {ALPHA})+, thereby  
> recognizing "AT&T", and "widget&javascript&html" as COMPANY. That at  
> least will allow developers to put a CompanyTokenFilter (for  
> example) after the tokenizer to break on "&" and "@" whenever there  
> are more than 2 parts. We could also modify StandardFilter (which  
> already handles ACRONYM) to handle COMPANY that way.
> - changing the rule to {ALPHA} ("&"|"@") {ALPHA} ({P} | "!" | "?")  
> so that we recognize company names only if the pattern is followed  
> by a space, dot, dash, underscore, exclamation mark or question  
> mark. That'll still recognize AT&T, but won't recognize  
> widget&javascript&html as COMPANY (which is good).
> If I had to choose, this sounds reasonable.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

View raw message