lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcel Hofmann <Marcel.Hofm...@web.de>
Subject Content-based similarity search in vector-space for Lucene
Date Tue, 02 Nov 2004 15:18:54 GMT
Hello!

For my diploma (available in german), i have written a similarity
search, that for an given document (query) returns documents, which
content is gradual similar to the query-document. With this
functionality, e.g. different versions from an document, plagiats of a
publication or related articels in the archiv of an scientific magazin
can be found.

The documents where indexed with lucene 1.4 and represented as
termvectors inside the lucene-index. For searching, an real
vectorspace-retrievalmodell (not an advanced boolean model) based on the
SMART-Retrievalsystem from Gerard Salton was implemented, including
tf-idf weighting and cosine-similarity-function. The whole search-space
is explored, no heuristical methods are used at time, but can be retrofited.

I have deployed an shortened version of the diploma-prototype, which
includes a GUI, one sample document-collection (CIA Factbook) but not
the sources of the project:

http://www.informatik.htw-dresden.de/~s4328/pub/diploma_Marcel_Hofmann.zip

The prototype can be started with the prototype/deploy/diploma.bat
(sorry to all non Windows users). The included readme.txt lists the
original content of the prototype, not the shortened version.

I would like to deploy an library to the lucene-project, which contains 
the core of the implementation (vector-space, cosine-similarity,...).
All you have to do is answer this mail, ask for this library and givee 
my some hints...

Greetings from Saxony, Germany
Marcel Hofmann
Marcel.Hofmann@web.de




Mime
View raw message