lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <>
Subject Re: Why does the StandardTokenizer split hyphenated words?
Date Wed, 15 Dec 2004 20:43:12 GMT

On Dec 15, 2004, at 3:14 PM, Mike Snare wrote:

> In addition, why do we assume that a-1 is a "typical product name" but
> a-b isn't?
> I am in no way second-guessing or suggesting a change, It just doesn't
> make sense to me, and I'm trying to understand.  It is very likely, as
> is oft the case, that this is just one of those things one has to
> accept.

It is one of those things we have to accept... or in this case write 
our own analyzer.  An Analyzer is a very special and custom choice.  
StandardAnalyzer is a general purpose one, but quite insufficient in 
many cases.  Like QueryParser.  We're lucky to have these kitchen-sink 
pieces in Lucene to get us going quickly, but digging deeper we often 
need custom solutions.

I'm working on indexing the e-book of Lucene in Action.  I'll blog up 
the details of this in the near future as case-study material, but 
here's the short version...

I got the PDF file, ran pdftotext on it.  Many words are split across 
lines with a hyphen.  Often these pieces should be combined with the 
hyphen removed.  Sometimes, though, these words are to be split.  The 
scenario is different than yours, because I want the hyphens gone - 
though sometimes they are a separator and sometimes they should be 
removed.  It depends.  I wrote a custom analyzer with several custom 
filters in the pipeline... dashes are originally kept in the stream, 
and a later filter combines two tokens and looks it up in an exception 
list and either combines it or leaves it separate.  StandardAnalyzer 
would have wreaked havoc.

The results of my work will soon be available to all to poke at, but 
for now a screenshot is all I have public:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message