lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant McLean <>
Subject Re: [lucy-user] Indexing HTML documents
Date Wed, 13 Jul 2011 10:26:55 GMT
On Tue, 2011-07-12 at 08:04 -0700, Marvin Humphrey wrote:
> On Tue, Jul 12, 2011 at 05:28:30PM +1200, Grant McLean wrote:
> > I was guessing that was probably the case.  I ended up using the
> > HTML::Strip module from CPAN and apart from a strange encoding issue
> > when it used HTML::Entities for entity expansion, it seems to have
> > worked reasonably well.
> Did you get that encoding issue licked?  If not, can you reproduce it?

Yes, I worked around it.  I'm not sure I can really pin the blame on
HTML::Entities. The thing that threw me was it seems that as long as the
entities being expanded fit in the U+0080 - U+00FF range the return
value is a byte string rather than a UTF-8 character string.  As soon as
it expands an entity beyond U+00FF, the returned values are character
strings with the UTF-8 flag set.  I think it was this inconsistent
behaviour that exposed a bad assumption in another part of my code.

> > The first thing I did that improved the results was to rewrite a raw
> > query string like this
> > 
> >     Votes for Women
> > 
> > into this:
> > 
> >     (vote AND for AND women) OR ("votes for women")
> > 
> > and pass the result to the query parser.
> You might also experiment with using LucyX::Search::ProximityQuery instead of
> PhraseQuery for the supplementary clause.

I would definitely be interested to do that but I'm not sure how to go
about it.  I built a query object like this:

    my $proximity_query = LucyX::Search::ProximityQuery->new( 
        field  => 'content',
        terms  => \@words,
        within => 10,    # match within 10 positions

But $searcher->hits( query => $proximity_query ) never seems to return
me any matches at all no matter what words I feed it.  Does it need to
be combined with another query?

> You may find the following trick useful for inspecting queries produced by
> QueryParser:
>     use Data::Dumper qw( Dumper );
>     my $query = $query_parser->parse($query_string);
>     warn Dumper($query->dump);

That's definitely very interesting.  I'll have a closer look at that
after a good night's sleep :-)

> > I also tried building my own query objects and combining them with
> > ORQuery and using 'boost' values for queries on important fields.  
> > ... If this is a valid area for people to explore, it
> > might be worth adding a working example or two to the documentation.
> I like the idea , but I'm not sure exactly where to work this in.

I think further study of the dumped query will probably move me in the
right direction.

> > The final issue I'd like to tackle is the handling of accents.  Ideally
> > I'd like to be able to treat 'cafe' and 'café' as equivalent.
> Arjan's suggestion is the way to go.  Thanks, Arjan!

What I've done is very similar to Arjan's suggestion.  I've actually
appended a normalised copy of the full text to the content field.  This
has the advantage that the highlighter can chose the part of the text
that includes or omits the accents as appropriate.

One new area I'd like to explore is the idea of assigning a page rank to
each document that could be used as a multiplier in the ranking.  In the
collection of documents I'm working with, the following rules apply:

 * documents with many incoming links are 'better'
 * recent documents are better than ones with older publication dates
 * longer documents are somewhat 'better' than shorter ones (or to
   put it another way we have quite a few very short documents that are
   definitely 'worse')

I'll do some more reading and see if I can work out where these sort of
rules might be applied.

Thanks everyone for the useful suggestions.


View raw message