lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "J.J. Larrea" <>
Subject Re: html parsers and numers of terms
Date Tue, 13 Dec 2005 16:58:31 GMT
Glad that hint was useful.  I was totally bit by that artifact myself.  It turns out that there
were XML numeric character references within VARCHAR fields in a database I was indexing,
so I never suspected that the NCRs I was seeing in Luke had anything to do with the non-XML
non-HTML (so I thought) source data.

Also take care that when fields are stored it is quite easy to get confused between the stored
values, which aren't analyzed, and the indexed tokens, which obviously are.  Asking Luke to
reconstruct a source document from its indexed tokens is a great way to see it from an "index-eye"
view, which can be very revealing.

- J.J.

At 11:36 AM -0500 12/13/05, Robert Watkins wrote:
>Aha! I had, indeed, been fooled by Luke into thinking that the entities
>had been converted upon analysis, but you have set me straight.
>-- Robert
>On Tue, 13 Dec 2005, J.J. Larrea wrote:
>>Beware of HTML/XML entities in your input stream!  The Lucene analyzers (including
StandardAnalyzer) do not interpret these representation-specific encodings, and assume the
& and ; delimiters are punctuation.  How they deal with punctuation depends on the specific
Analyzer logic.
>>[ snipped ]
>>PS: Also note that when using Luke to see what is indexed, it uses NCRs eg. &#233;
to display non-ASCII characters, allowing one to be easily confused as to whether the NCRs
were indexed or the Unicode characters were indexed.
>To unsubscribe, e-mail:
>For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message