lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raul Raja Martinez <doblee...@estudiowebs.com>
Subject Re: Best practice for searching html
Date Fri, 10 Mar 2006 00:20:11 GMT
Hi,

Unfortunately I can't change the way things are indexed, so I guess I 
need some short of utility class that will turn Martínez into 
Mart&iacute;nez and then just search for that term.

I have also this problem using the StandardAnalyzer:

If I search for "c&aacute;diz" in luke the query gets parsed as 
"c&aacute diz" it changes ";" for " ". I would like to know which 
symbols are not to be used in the search and how to scape them.

What would be the right query to find "c&aacute;diz"?

Thanks.

Raul

Jens Kraemer wrote:
> Hi!
> 
> On Thu, Mar 09, 2006 at 04:31:23AM -0800, Raul Raja Martinez wrote:
>> Hi I have a lot of html indexed such as:
>>
>> Mart&iacute;nez
>>
>> Of course my users are gonna search for Martínez and they're not gonna 
>> get a match.
>>
>> Is there a common approach to solve this kind of problem in lucene, 
>> Maybe some utility class or something?
> 
> there is a class named Entities in Jakarta commons-lang, which can
> be used to resolve html entities before indexing. Maybe you could
> integrate this into a custom analyzer.
> 
> http://jakarta.apache.org/commons/lang/xref/org/apache/commons/lang/Entities.html
> 
> regards,
> Jens
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message