lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joshua O'Madadhain" <jmad...@ics.uci.edu>
Subject Re: Tags Screwing up Searches
Date Mon, 21 Oct 2002 21:49:25 GMT
On Mon, 21 Oct 2002, Terry Steichen wrote:

> I discovered that the actual text that I was dealing with already
> converted the '<' converted to '&lt;', and so forth.  So the problem
> is that with something like '&lt;b&gt;College Soccer&lt;/b&gt;',
> Lucene recognizes the trailing semi-colon ';' as a word separator, so
> it can find the term 'college', but it does not see the ending of
> 'soccer'.  I did confirm that it *will* match on 'soccer&lt;' just
> fine.
> 
> I've proceeded to add a string substitution method which replaces
> '&lt;' with ' ' (four spaces, in order to hopefully keep the offsets
> straight). It appears to work, though I believe it slows down the
> indexing.
> 
> I don't know enough about the inner design of Lucene to figure this
> out, but it seems logical that there would be a much more efficient
> way to handle this than string operations.
> 
> PS: I've had no responses from the list, so perhaps this is a unique
> problem and doesn't justify a formal fix effort.

A few questions and comments; please pardon me if I am asking questions
answered in previous email:

(1) Are you using an analyzer that is designed to handle (a) HTML, or
(b) plain text?

(2) If (b), that's probably why you've been getting this kind of behavior,
and you may want to look at the HTMLParser sample code in the
distribution.  The StandardAnalyzer, I'm pretty sure, is not designed to
handle HTML.

(3) A quick and dirty solution for indexing HTML if you are running on
some flavor of Unix and don't want to figure out how to do parse HTML
tags: the text web browser "lynx".  lynx can 'dump' the text from a web
page out as follows:

cat foo.html | lynx -dump -nolist  > foo.txt

This effectively strips the HTML tags out of foo.html and writes the text
of the page to the file foo.txt.

Once you've done this, of course, you can use the same analyzers that you
use for any unformatted text file.

Good luck--

Joshua O'Madadhain

 jmadden@ics.uci.edu...Obscurium Per Obscurius...www.ics.uci.edu/~jmadden
  Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
 It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.





--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message