lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jens Krämer>
Subject Re: [lucy-user] Indexing HTML documents
Date Tue, 12 Jul 2011 15:54:35 GMT

On 12.07.2011, at 09:39, arjan wrote:
> What you could do to match words with and without accents is adding an extra field for
the content without accents. There are perl modules available to replace accented characters.
This is called "normalization form d".

Wouldn't doing so break the highlighting of matching terms because the hit for 'cafe' then
would occur in the normalized field, but not in the 'main' field that most probably would
be used for showing the excerpt?

I don't know Lucy (yet ;-) but I've done lots of work with Lucene and Ferret, and there I
usually normalize accented characters (and german umlauts) with a special token filter that's
part of a custom analyzer.

Imho treating 'é' like 'e' should be no harder than treating 'E' like 'e', so I'd say going
where your tokens are being downcased and hooking in there to additionally perform more normalizations
should be the way to go. But as I said, I have no idea if and how this is possible in Lucy...


> On 12-07-11 07:28, Grant McLean wrote:
>> On Sun, 2011-07-10 at 22:47 -0700, Marvin Humphrey wrote:
>>> On Mon, Jul 11, 2011 at 03:28:23PM +1200, Grant McLean wrote:
>> The final issue I'd like to tackle is the handling of accents.  Ideally
>> I'd like to be able to treat 'cafe' and 'café' as equivalent.  The user
>> should be able to type a query with-or-without the accent and match
>> documents with-or-without the accent and have the excerpt highlighting
>> pick up words with-or-without the accent.  I would prefer not to have
>> the search results and excerpts lacking accents if they are present in
>> the source document.  Is this dream scenario possible?  Perhaps with
>> synonyms?  Can anyone suggest an approach?
>> Thanks
>> Grant

Jens Krämer
Finkenlust 14, 06449 Aschersleben, Germany
VAT Id DE251962952

View raw message