lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joshua O'Madadhain" <>
Subject Re: Newbie quizzes further...
Date Tue, 03 Sep 2002 00:35:54 GMT
On Mon, 2 Sep 2002, Stone, Timothy wrote:

> I have noted that Lucene fails to interpret numerous HTML entities,
> specifically entities in the 82xx range, i.e. &#8212; (en-dash) and
> many others. Now this may not be a Lucene issue, I'm looking at the
> code as I post, but I'm curious to its origins and why they don't seem
> to be parsed properly in the index.

As I see it, there are two answers to this question.

(1) What gets parsed and indexed is your choice; there are various
different Analyzers that are included with the Lucene package, which have
different effects.  You could conceivably construct an Analyzer that would
parse such entities as you describe.

(2) Historically punctuation has not been parsed by search engines, for
the simple reason that it doesn't tend to add much to search precision and
it complicates the indexing process.  (On the other hand, if you're
talking about accents and non-English letters, I understand that some
people have written analyzers that cover these things; check out the
contrib section on the Lucene website.)

Joshua O'Madadhain Per
  Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
 It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message