lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: Incomprehensible (to me) tokenizing behavior
Date Mon, 30 Dec 2002 18:42:25 GMT
Terry Steichen wrote:
> I tested StandardAnalyzer (which uses StandardTokenizer) by inputing the a set of strings
which produced the following results:
> "aa/bb/cc/dd" was tokenized into 4 terms: aa, bb, cc, dd
> "aa/bb/cc/d1" was tokenized into 3 terms: aa, bb, cc/d1 
> "aa/bb/c1/dd" was tokenized into 2 terms: aa, bb/c1/dd
> "aa/b1/cc/dd" was tokenized into 2 terms: aa/b1/cc, dd
> "a1/bb/cc/dd" was tokenized into 3 terms: a1/bb, cc, dd
> It seems that if the input string includes a numerical value, any first preceeding and/or
next following slash ('/') is treated as a character.  Otherwise the slash is apparently treated
as a token separator.
> I'm lost.  Assuming this is not a bug, could somebody explain the rhyme and reason to
this tokenizing logic?  

This is a heuristic that tries to index alphanumeric model and serial 
numbers as a single token, but not to index long hyphenated or slashed 
phrases as a single token.  It requires digits in at least every other 
slash or dash-delimted segment.

Perhaps it is misguided.  Can you suggest a better heuristic?

The challenge is to index things like "B-17", "F/A-18", "PS/2", 
"802.11a", "0-85152-629-2", etc. as single tokens, but not to index 
things like "once-famous-but-now-forgotten" or 
"red/orange/yellow/green/blue/indigo/violet" as single tokens.


To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message