lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Sturge <tstu...@metaweb.com>
Subject indexing anchor text
Date Wed, 27 Jun 2007 21:46:17 GMT
Hi,

I'm trying to index some fairly standard html documents. For each of the 
documents, there is a unique <title> (which I believe is generally of 
high quality), some <body> content, and some anchor text from the 
linking documents (which is of good but more variable quality).

I'm indexing them in "title" "anchor" and "body"

"title" and "body" are obvious (you just give the text to the 
StandardAnalyzer) but I don't really know how to handle the anchor text. 
Suppose the page with the title "United States" I know has the anchor 
text "USA" 500 times, "United States" 200 times, "United States of 
America" 100 times and "Unite Stats" once.

How do I index this?

1) index a single "anchor" field containing "USA United States United 
States of America Unite Stats",
2) create the field  "USA USA ...500x... USA  United States ...200x... 
United States ... " and index that as "anchor"
3) create 801 "anchor" fields (500 containg USA etc)
4) create 4 "anchor" fields and call setBoost() on each with some 
constants. (how do I calculate them?)

I suspect these give me different results in some way, but I'm having 
trouble understanding what the difference between 2) and 3) is and how 
to make 4) work like 3). I also worry that 2) and 3) are much slower 
than they need to be.

Any help is appreciated,

Tim




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message