lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: indexing anchor text
Date Wed, 27 Jun 2007 21:54:53 GMT
Well, to quote the great wise one, "that depends". The reason I'm
being flippant here is because what it depends on is what you want
the result to be.

I'm asking for a use-case scenario here. Something like
"I want the docs to score equally no matter how many
links with 'United States' exist in them". Or
"A document with 100 links mentioning 'United States' should
score way higher than a document with only one link mentioning
'United States'".

Best
Erick

On 6/27/07, Tim Sturge <tsturge@metaweb.com> wrote:
>
> Hi,
>
> I'm trying to index some fairly standard html documents. For each of the
> documents, there is a unique <title> (which I believe is generally of
> high quality), some <body> content, and some anchor text from the
> linking documents (which is of good but more variable quality).
>
> I'm indexing them in "title" "anchor" and "body"
>
> "title" and "body" are obvious (you just give the text to the
> StandardAnalyzer) but I don't really know how to handle the anchor text.
> Suppose the page with the title "United States" I know has the anchor
> text "USA" 500 times, "United States" 200 times, "United States of
> America" 100 times and "Unite Stats" once.
>
> How do I index this?
>
> 1) index a single "anchor" field containing "USA United States United
> States of America Unite Stats",
> 2) create the field  "USA USA ...500x... USA  United States ...200x...
> United States ... " and index that as "anchor"
> 3) create 801 "anchor" fields (500 containg USA etc)
> 4) create 4 "anchor" fields and call setBoost() on each with some
> constants. (how do I calculate them?)
>
> I suspect these give me different results in some way, but I'm having
> trouble understanding what the difference between 2) and 3) is and how
> to make 4) work like 3). I also worry that 2) and 3) are much slower
> than they need to be.
>
> Any help is appreciated,
>
> Tim
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message