incubator-lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: [lucy-user] Indexing HTML documents
Date Tue, 12 Jul 2011 15:04:08 GMT
On Tue, Jul 12, 2011 at 05:28:30PM +1200, Grant McLean wrote:
> I was guessing that was probably the case.  I ended up using the
> HTML::Strip module from CPAN and apart from a strange encoding issue
> when it used HTML::Entities for entity expansion, it seems to have
> worked reasonably well.

Did you get that encoding issue licked?  If not, can you reproduce it?

I've debugged encoding issues with HTML::Entities before.  Older versions had
a lot of problems, but newer ones are much better behaved -- so you might try
upgrading if you haven't already.

> The first thing I did that improved the results was to rewrite a raw
> query string like this
> 
>     Votes for Women
> 
> into this:
> 
>     (vote AND for AND women) OR ("votes for women")
> 
> and pass the result to the query parser.

You might also experiment with using LucyX::Search::ProximityQuery instead of
PhraseQuery for the supplementary clause.

> Initially I found that doing a phrase search by wrapping double quotes
> around the words didn't seem to make any difference to the results.
> This seemed to be because the phrases I was using all contained
> stopwords and I had indexed using a PolyAnalyser with
> SnowballStopFilter.

Using a SnowballStopFilter strips the stopwords out of the token array:

    "votes for women" => "votes women"

Stoplists have certain advantages, particularly in terms of shrinking index
size, but they can have detrimental effects on recall, particularly when you
need to search for something like '"The Smiths"' and your search returns
everything that contains 'smith'.  Also, Lucy's scoring model tends to
diminish the impact of common terms:

    http://incubator.apache.org/lucy/docs/perl/Lucy/Docs/IRTheory.html#TF-IDF-ranking-algorithm

    ... in a search for skate park, documents which score well for the
    comparatively rare term 'skate' will rank higher than documents which score
    well for the more common term 'park'.

You may find the following trick useful for inspecting queries produced by
QueryParser:

    use Data::Dumper qw( Dumper );
    my $query = $query_parser->parse($query_string);
    warn Dumper($query->dump);

> I also tried building my own query objects and combining them with
> ORQuery and using 'boost' values for queries on important fields.  This
> exercise was largely fruitless.  In the cases where I managed to get any
> results at all, the effect of boost seemed to be exactly the opposite of
> what I expected - a larger boost led to a smaller score.

Scores are not absolute -- they are only meaningful as relative measures
within the context of a single search.

> That might be because the bits of the query that matched weren't actually
> the ones I expected.  If this is a valid area for people to explore, it
> might be worth adding a working example or two to the documentation.

I like the idea , but I'm not sure exactly where to work this in.  It doesn't
belong in the reference documentation for the individual classes.  Instead it
should go in application documentation under Lucy::Docs -- but I don't think
there's an appropriate article there yet.

Perhaps we could use an article on tuning scoring, Lucy::Docs::Tuning or
something like that.

> So I now have a setup that works reasonably well and gives sensible
> rankings.

:) 

> The final issue I'd like to tackle is the handling of accents.  Ideally
> I'd like to be able to treat 'cafe' and 'café' as equivalent.  The user
> should be able to type a query with-or-without the accent and match
> documents with-or-without the accent and have the excerpt highlighting
> pick up words with-or-without the accent.  I would prefer not to have
> the search results and excerpts lacking accents if they are present in
> the source document.  Is this dream scenario possible?  Perhaps with
> synonyms?  Can anyone suggest an approach?

Arjan's suggestion is the way to go.  Thanks, Arjan!

Marvin Humphrey


Mime
View raw message