lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant McLean <gr...@catalyst.net.nz>
Subject Re: [lucy-user] Indexing HTML documents
Date Tue, 12 Jul 2011 05:28:30 GMT
On Sun, 2011-07-10 at 22:47 -0700, Marvin Humphrey wrote:
> On Mon, Jul 11, 2011 at 03:28:23PM +1200, Grant McLean wrote:
> > My main interest is indexing HTML documents for web sites.  It seems
> > that if I feed the HTML file contents to the Lucy indexer, all the
> > markup (tags and attributes) ends up in the index and consequently comes
> > back out in the highlighted excerpts. Is it my responsibility to strip
> > the tags out before passing the text to the indexer?
> 
> You have to handle document parsing yourself and supply plain text to Lucy.

I was guessing that was probably the case.  I ended up using the
HTML::Strip module from CPAN and apart from a strange encoding issue
when it used HTML::Entities for entity expansion, it seems to have
worked reasonably well.

I'm now interested tuning my setup for better quality search results.
My current application cannot assume a sophisticated user base - they
just want to bang a word or phrase into the search box and hit go.

The first thing I did that improved the results was to rewrite a raw
query string like this

    Votes for Women

into this:

    (vote AND for AND women) OR ("votes for women")

and pass the result to the query parser.

Initially I found that doing a phrase search by wrapping double quotes
around the words didn't seem to make any difference to the results.
This seemed to be because the phrases I was using all contained
stopwords and I had indexed using a PolyAnalyser with
SnowballStopFilter.

The next improvement I made was to index the document title field
without using a stopword filter (I left the filter on for the document
body) and also add a 'boost => 5' to the type definition for the title
field.

This resulted in a more manageable number of hits and better ranking.

I also tried building my own query objects and combining them with
ORQuery and using 'boost' values for queries on important fields.  This
exercise was largely fruitless.  In the cases where I managed to get any
results at all, the effect of boost seemed to be exactly the opposite of
what I expected - a larger boost led to a smaller score.  That might be
because the bits of the query that matched weren't actually the ones I
expected.  If this is a valid area for people to explore, it might be
worth adding a working example or two to the documentation.

So I now have a setup that works reasonably well and gives sensible
rankings.

The final issue I'd like to tackle is the handling of accents.  Ideally
I'd like to be able to treat 'cafe' and 'café' as equivalent.  The user
should be able to type a query with-or-without the accent and match
documents with-or-without the accent and have the excerpt highlighting
pick up words with-or-without the accent.  I would prefer not to have
the search results and excerpts lacking accents if they are present in
the source document.  Is this dream scenario possible?  Perhaps with
synonyms?  Can anyone suggest an approach?

Thanks
Grant





Mime
View raw message