lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raul Raja Martinez <>
Subject Re: Best practice for searching html
Date Fri, 10 Mar 2006 00:20:11 GMT

Unfortunately I can't change the way things are indexed, so I guess I 
need some short of utility class that will turn Martínez into 
Mart&iacute;nez and then just search for that term.

I have also this problem using the StandardAnalyzer:

If I search for "c&aacute;diz" in luke the query gets parsed as 
"c&aacute diz" it changes ";" for " ". I would like to know which 
symbols are not to be used in the search and how to scape them.

What would be the right query to find "c&aacute;diz"?



Jens Kraemer wrote:
> Hi!
> On Thu, Mar 09, 2006 at 04:31:23AM -0800, Raul Raja Martinez wrote:
>> Hi I have a lot of html indexed such as:
>> Mart&iacute;nez
>> Of course my users are gonna search for Martínez and they're not gonna 
>> get a match.
>> Is there a common approach to solve this kind of problem in lucene, 
>> Maybe some utility class or something?
> there is a class named Entities in Jakarta commons-lang, which can
> be used to resolve html entities before indexing. Maybe you could
> integrate this into a custom analyzer.
> regards,
> Jens

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message