lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Terry Steichen" <>
Subject Re: Tags Screwing up Searches
Date Mon, 21 Oct 2002 21:22:18 GMT

I discovered that the actual text that I was dealing with already converted
the '<' converted to '&lt;', and so forth.  So the problem is that with
something like '&lt;b&gt;College Soccer&lt;/b&gt;', Lucene recognizes the
trailing semi-colon ';' as a word separator, so it can find the term
'college', but it does not see the ending of 'soccer'.  I did confirm that
it *will* match on 'soccer&lt;' just fine.

I've proceeded to add a string substitution method which replaces '&lt;'
with '    ' (four spaces, in order to hopefully keep the offsets straight).
It appears to work, though I believe it slows down the indexing.

I don't know enough about the inner design of Lucene to figure this out, but
it seems logical that there would be a much more efficient way to handle
this than string operations.

Anyway, thought I'd bring you up to date.



PS: I've had no responses from the list, so perhaps this is a unique problem
and doesn't justify a formal fix effort.

----- Original Message -----
From: "Terry Steichen" <>
To: "Lucene Users Group" <>
Sent: Friday, October 18, 2002 11:39 AM
Subject: Tags Screwing up Searches

Some content I'm indexing contains certain HTML tags, like <p>, <b>, <i>,
etc.  What I find is that when a term I'm searching for touches one of these
tags (which is fairly typical), the term isn't recognized and the search
fails.  For example, <b>College Soccer</b> doesn't match on either "college"
or "soccer".  I seem to recall someone else bring up a similar problem with
a word that ends a sentence (and is thus treated as if the period was part
of the word), but don't recall what the response was and I can't find that

Does anyone have some ideas on what's the best way to handle this?  Filter
out the tags in the process of creating the Document for indexing? Or
through a modification to the Analyzer (I'm using the StandardAnalyzer)? Or
something else?



To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message