lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Scott Davies" <sco...@gmail.com>
Subject Per-token weighting / attribute data in index
Date Fri, 02 Jun 2006 20:14:41 GMT
Hi...reasonably experienced web search programmer but total Lucene newbie here.

After poking through Lucene for a while, I still haven't figured out a
decent way to tweak the scoring based on per-token data.  For example,
as far as I can tell so far, the only reasonable way to have words in
the titles or headers of HTML documents be "worth more" for scoring
purposes than ordinary body text is to make "title" and "header"
fields and apply appropriate field boosts across all documents.  That
works OK if you only have a few special fields you want to boost by
some consistent amount each, but falls down if, say, you wanted to
include some sort of "tags" or anchortext in the scoring of documents
where there's a high degree of variability in how much any given tag
or anchor should be "trusted" and thus influence the score.  (I could
conceivably discretize the boosts and, say, put all the anchortext
with boost 2.5 in a special "anchortext-boost2.5" field, but that
would be extremely awkward and presumably cause major performance
issues as the number of fields increases.)

Have I just failed to notice the right way to do this, or is there
really no decent way to do it in Lucene at this time?  If the latter,
are there any plans to add this feature at some point semi-soon?  This
seems to me like a major scoring limitation for applications not just
indexing and searching over plain text documents...

Thanks,

-- Scott

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message