lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: How to compute the simlarity of a web page?
Date Tue, 17 Feb 2009 03:08:42 GMT
Hmmm, you might be able to do the following:

Create a document in a memory index containing the web page
Create a query from the keywords
Do a search with the query against the memory index and see the score.

Alternatively, you could use the corpus statistics plus to create a  
term vector from the document (as if it were a member of the  
collection) and then do the cosine calculation of that document with  
your query (which you also calculated the weights for based on your  
collections stats)

Last, it sounds like you are essentially describing a categorization  
task.  Have a look at some categorization software (for instance,  
Mahout can do Naive Bayes categorization or some alternatives).

Of course, I might be missing something in understanding what you are  
asking, so feel free to give a shout back to discuss.


On Feb 12, 2009, at 1:31 AM, renavatior wrote:

> I am doing some research in vertical search? Therefore, i defined some
> weights of several keywords in my corpus expressing a certain
> theme,later,how can i use these to compute the similarity with the  
> given web
> page(passed by url to the compute method).I saw the source code of
> in Lucene,but i do not know how to use the method  
> such as
> TF,IDF,and so on.
> i will really appreciate it if anyone can give me some advice,thanks  
> in
> advance.
> -- 
> View this message in context:
> Sent from the Lucene - Java Users mailing list archive at
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Grant Ingersoll

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message