lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Scott Davies" <sco...@gmail.com>
Subject Re: Per-token weighting / attribute data in index
Date Fri, 02 Jun 2006 21:30:45 GMT
A simple example would be indexing and scoring the hyperlink text from
other web pages that point to the page P that I'm indexing/scoring.  I
might have some metric saying how much I "trust" each of the pages or
sites with hyperlinks to P, and want to use that metric to increase or
decrease how much the text in those hyperlinks increases the score of
P for queries containing that anchortext.  Since each incoming
hyperlink is from a different site with a different trustworthiness,
I'd obviously want to be able to vary that boost independently for
every different hyperlink pointing at page P.

On 6/2/06, Chris Hostetter <hossman_lucene@fucit.org> wrote:
>
> i may be missunderstanding your goal .. it sounds like what you want to do
> is say thta for certain documents (which you trust) matching on the title
> is "worth more" then matching on the title of other documents (which you
> don't trust)
>
> if that' the case, then at index time you can add field boost on the
> title just for hte documents you trust, and add no boost for hte documents
> you don't trust.
>
> I've i've missunderstood your question, could you provide a use case
> describing your goal, and where lucene fails to meet it?
>
>
>
> : Date: Fri, 2 Jun 2006 13:14:41 -0700
> : From: Scott Davies <scottd@gmail.com>
> : Reply-To: java-user@lucene.apache.org
> : To: java-user@lucene.apache.org
> : Subject: Per-token weighting / attribute data in index
> :
> : Hi...reasonably experienced web search programmer but total Lucene newbie here.
> :
> : After poking through Lucene for a while, I still haven't figured out a
> : decent way to tweak the scoring based on per-token data.  For example,
> : as far as I can tell so far, the only reasonable way to have words in
> : the titles or headers of HTML documents be "worth more" for scoring
> : purposes than ordinary body text is to make "title" and "header"
> : fields and apply appropriate field boosts across all documents.  That
> : works OK if you only have a few special fields you want to boost by
> : some consistent amount each, but falls down if, say, you wanted to
> : include some sort of "tags" or anchortext in the scoring of documents
> : where there's a high degree of variability in how much any given tag
> : or anchor should be "trusted" and thus influence the score.  (I could
> : conceivably discretize the boosts and, say, put all the anchortext
> : with boost 2.5 in a special "anchortext-boost2.5" field, but that
> : would be extremely awkward and presumably cause major performance
> : issues as the number of fields increases.)
> :
> : Have I just failed to notice the right way to do this, or is there
> : really no decent way to do it in Lucene at this time?  If the latter,
> : are there any plans to add this feature at some point semi-soon?  This
> : seems to me like a major scoring limitation for applications not just
> : indexing and searching over plain text documents...
> :
> : Thanks,
> :
> : -- Scott
> :
> : ---------------------------------------------------------------------
> : To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> : For additional commands, e-mail: java-user-help@lucene.apache.org
> :
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message