lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: Tags Screwing up Searches
Date Mon, 21 Oct 2002 21:40:51 GMT
Thanks for the update.
This all sounds right (no bugs).  The problem is the code that you have
that translates those < and > characters.

Otis

--- Terry Steichen <terry@net-frame.com> wrote:
> Otis,
> 
> I discovered that the actual text that I was dealing with already
> converted
> the '<' converted to '&lt;', and so forth.  So the problem is that
> with
> something like '&lt;b&gt;College Soccer&lt;/b&gt;', Lucene recognizes
> the
> trailing semi-colon ';' as a word separator, so it can find the term
> 'college', but it does not see the ending of 'soccer'.  I did confirm
> that
> it *will* match on 'soccer&lt;' just fine.
> 
> I've proceeded to add a string substitution method which replaces
> '&lt;'
> with '    ' (four spaces, in order to hopefully keep the offsets
> straight).
> It appears to work, though I believe it slows down the indexing.
> 
> I don't know enough about the inner design of Lucene to figure this
> out, but
> it seems logical that there would be a much more efficient way to
> handle
> this than string operations.
> 
> Anyway, thought I'd bring you up to date.
> 
> Regards,
> 
> Terry
> 
> PS: I've had no responses from the list, so perhaps this is a unique
> problem
> and doesn't justify a formal fix effort.
> 
> ----- Original Message -----
> From: "Terry Steichen" <terry@net-frame.com>
> To: "Lucene Users Group" <lucene-user@jakarta.apache.org>
> Sent: Friday, October 18, 2002 11:39 AM
> Subject: Tags Screwing up Searches
> 
> 
> Some content I'm indexing contains certain HTML tags, like <p>, <b>,
> <i>,
> etc.  What I find is that when a term I'm searching for touches one
> of these
> tags (which is fairly typical), the term isn't recognized and the
> search
> fails.  For example, <b>College Soccer</b> doesn't match on either
> "college"
> or "soccer".  I seem to recall someone else bring up a similar
> problem with
> a word that ends a sentence (and is thus treated as if the period was
> part
> of the word), but don't recall what the response was and I can't find
> that
> thread.
> 
> Does anyone have some ideas on what's the best way to handle this? 
> Filter
> out the tags in the process of creating the Document for indexing? Or
> through a modification to the Analyzer (I'm using the
> StandardAnalyzer)? Or
> something else?
> 
> TIA,
> 
> Terry
> 
> 
> 
> 
> --
> To unsubscribe, e-mail:  
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
> 


__________________________________________________
Do you Yahoo!?
Y! Web Hosting - Let the expert host your web site
http://webhosting.yahoo.com/

--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message