lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Terry Steichen" <te...@net-frame.com>
Subject Re: Tags Screwing up Searches
Date Mon, 21 Oct 2002 21:43:22 GMT
How should this be done (the translation, that is)?  If it were left as '<'
and '>', would Lucene parse it properly?

Terry

----- Original Message -----
From: "Otis Gospodnetic" <otis_gospodnetic@yahoo.com>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Sent: Monday, October 21, 2002 5:40 PM
Subject: Re: Tags Screwing up Searches


> Thanks for the update.
> This all sounds right (no bugs).  The problem is the code that you have
> that translates those < and > characters.
>
> Otis
>
> --- Terry Steichen <terry@net-frame.com> wrote:
> > Otis,
> >
> > I discovered that the actual text that I was dealing with already
> > converted
> > the '<' converted to '&lt;', and so forth.  So the problem is that
> > with
> > something like '&lt;b&gt;College Soccer&lt;/b&gt;', Lucene recognizes
> > the
> > trailing semi-colon ';' as a word separator, so it can find the term
> > 'college', but it does not see the ending of 'soccer'.  I did confirm
> > that
> > it *will* match on 'soccer&lt;' just fine.
> >
> > I've proceeded to add a string substitution method which replaces
> > '&lt;'
> > with '    ' (four spaces, in order to hopefully keep the offsets
> > straight).
> > It appears to work, though I believe it slows down the indexing.
> >
> > I don't know enough about the inner design of Lucene to figure this
> > out, but
> > it seems logical that there would be a much more efficient way to
> > handle
> > this than string operations.
> >
> > Anyway, thought I'd bring you up to date.
> >
> > Regards,
> >
> > Terry
> >
> > PS: I've had no responses from the list, so perhaps this is a unique
> > problem
> > and doesn't justify a formal fix effort.
> >
> > ----- Original Message -----
> > From: "Terry Steichen" <terry@net-frame.com>
> > To: "Lucene Users Group" <lucene-user@jakarta.apache.org>
> > Sent: Friday, October 18, 2002 11:39 AM
> > Subject: Tags Screwing up Searches
> >
> >
> > Some content I'm indexing contains certain HTML tags, like <p>, <b>,
> > <i>,
> > etc.  What I find is that when a term I'm searching for touches one
> > of these
> > tags (which is fairly typical), the term isn't recognized and the
> > search
> > fails.  For example, <b>College Soccer</b> doesn't match on either
> > "college"
> > or "soccer".  I seem to recall someone else bring up a similar
> > problem with
> > a word that ends a sentence (and is thus treated as if the period was
> > part
> > of the word), but don't recall what the response was and I can't find
> > that
> > thread.
> >
> > Does anyone have some ideas on what's the best way to handle this?
> > Filter
> > out the tags in the process of creating the Document for indexing? Or
> > through a modification to the Analyzer (I'm using the
> > StandardAnalyzer)? Or
> > something else?
> >
> > TIA,
> >
> > Terry
> >
> >
> >
> >
> > --
> > To unsubscribe, e-mail:
> > <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail:
> > <mailto:lucene-user-help@jakarta.apache.org>
> >
>
>
> __________________________________________________
> Do you Yahoo!?
> Y! Web Hosting - Let the expert host your web site
> http://webhosting.yahoo.com/
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
>
>


--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message