lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kadlabalu, Hareesh" <hareesh.kadlab...@fatwire.com>
Subject Searching for similar documents
Date Sat, 16 Jul 2005 05:25:08 GMT
Hi, 
I am trying to build a search utility that looks for 'similarities' between
documents.
In other words, for every document listed as a part of search result for a
phrase, I want to be able to list documents that are similar to it (but not
necessarily match the same search criterion). For example, if my search for
"Tomcat" returned "Tomcat installation guide", I want to write a utility
that looks for all similar installation guides that may or may not be
related to Tomcat.

One approach I am thinking is to use term vectors. Algorithm: first extract
the top X term vectors from the current document and create a Boolean query
for those terms. Run it against contents of other documents (I will probably
have to remove commonly used terms manually?). Resulting documents should be
similar to the original one. 

I am wondering if something like this already exists or someone has a better
algorithm/solution. Or am I headed off in the wrong direction with this
algorithm? Your advice is highly appreciated. 

Thanks
-Hareesh 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message