lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Watkins <>
Subject Re: html parsers and numers of terms
Date Tue, 13 Dec 2005 16:36:38 GMT
Aha! I had, indeed, been fooled by Luke into thinking that the entities
had been converted upon analysis, but you have set me straight.

-- Robert

On Tue, 13 Dec 2005, J.J. Larrea wrote:

> Beware of HTML/XML entities in your input stream!  The Lucene analyzers (including StandardAnalyzer)
do not interpret these representation-specific encodings, and assume the & and ; delimiters
are punctuation.  How they deal with punctuation depends on the specific Analyzer logic.
> [ snipped ]
> PS: Also note that when using Luke to see what is indexed, it uses NCRs eg. &#233;
to display non-ASCII characters, allowing one to be easily confused as to whether the NCRs
were indexed or the Unicode characters were indexed.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message