lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From arjan <>
Subject Re: [lucy-user] Indexing HTML documents
Date Tue, 12 Jul 2011 07:39:31 GMT
Hi Grant,

What you could do to match words with and without accents is adding an 
extra field for the content without accents. There are perl modules 
available to replace accented characters. This is called "normalization 
form d".

Kind regards,

On 12-07-11 07:28, Grant McLean wrote:
> On Sun, 2011-07-10 at 22:47 -0700, Marvin Humphrey wrote:
>> On Mon, Jul 11, 2011 at 03:28:23PM +1200, Grant McLean wrote:
>>> My main interest is indexing HTML documents for web sites.  It seems
>>> that if I feed the HTML file contents to the Lucy indexer, all the
>>> markup (tags and attributes) ends up in the index and consequently comes
>>> back out in the highlighted excerpts. Is it my responsibility to strip
>>> the tags out before passing the text to the indexer?
>> You have to handle document parsing yourself and supply plain text to Lucy.
> I was guessing that was probably the case.  I ended up using the
> HTML::Strip module from CPAN and apart from a strange encoding issue
> when it used HTML::Entities for entity expansion, it seems to have
> worked reasonably well.
> I'm now interested tuning my setup for better quality search results.
> My current application cannot assume a sophisticated user base - they
> just want to bang a word or phrase into the search box and hit go.
> The first thing I did that improved the results was to rewrite a raw
> query string like this
>      Votes for Women
> into this:
>      (vote AND for AND women) OR ("votes for women")
> and pass the result to the query parser.
> Initially I found that doing a phrase search by wrapping double quotes
> around the words didn't seem to make any difference to the results.
> This seemed to be because the phrases I was using all contained
> stopwords and I had indexed using a PolyAnalyser with
> SnowballStopFilter.
> The next improvement I made was to index the document title field
> without using a stopword filter (I left the filter on for the document
> body) and also add a 'boost =>  5' to the type definition for the title
> field.
> This resulted in a more manageable number of hits and better ranking.
> I also tried building my own query objects and combining them with
> ORQuery and using 'boost' values for queries on important fields.  This
> exercise was largely fruitless.  In the cases where I managed to get any
> results at all, the effect of boost seemed to be exactly the opposite of
> what I expected - a larger boost led to a smaller score.  That might be
> because the bits of the query that matched weren't actually the ones I
> expected.  If this is a valid area for people to explore, it might be
> worth adding a working example or two to the documentation.
> So I now have a setup that works reasonably well and gives sensible
> rankings.
> The final issue I'd like to tackle is the handling of accents.  Ideally
> I'd like to be able to treat 'cafe' and 'café' as equivalent.  The user
> should be able to type a query with-or-without the accent and match
> documents with-or-without the accent and have the excerpt highlighting
> pick up words with-or-without the accent.  I would prefer not to have
> the search results and excerpts lacking accents if they are present in
> the source document.  Is this dream scenario possible?  Perhaps with
> synonyms?  Can anyone suggest an approach?
> Thanks
> Grant


Hoe verslaan de media het politieke nieuws? Wie haalt het nieuws en hoe werkt dat uit? Bekijk
het in de MediaCalculator:

Congres- en nieuwsbriefportaal met relatiebeheer systeem voor het Landelijk Overleg Milieuhandhaving

Setting Standards, a Delft University of Technology and United Knowledge simulation exercise
on strategy and cooperation in standardization,

United Knowledge, internet voor de publieke sector
Keizersgracht 74
1015 CT Amsterdam
T +31 (0)20 52 18 300
F +31 (0)20 52 18 301

M +31 (0)6 2427 1444

Bezoek onze site op:

Of bekijk een van onze projecten:

View raw message