lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Rygl <ji...@rare-technologies.com>
Subject Search similar documents using dense vectors (alternative to MORELIKETHIS)
Date Wed, 24 Feb 2016 17:45:48 GMT
Hello,

I would like to ask if has somebody tried/planned to implement indexing for
dense vectors. The default scoring process is suitable only for text
documents, but we would like to use/support/develop a plugin enabling to
combine/replace default index by the dense vector index for non-textual
documents.

We have documents represented by both texts and float vectors.
We would like to be able to search similar documents to a given document
using a document vector (and not to convert document to query like
MORELIKETHIS).

There is a vector encoding to text technique, but it is not very accurate:
 * float numbers 0.0, 0.1, 0.8 for one vector position have different
distances |0.0 - 0.1| < |0.1 - 0.8|, but encoded strings don't:
'V1-0.00-0.05' ~ 'V1-0.05-0.10' ~ 'V1-0.80-0.85',
therefore we would like to search the whole dense vector in Lucene index
(using some existing vector index technique, e.g.
https://github.com/spotify/annoy).

My question is whether this functionality was tested by somebody before and
what is your opinion about implementing it. Is it technically possible to
make a plugin supporting this functionality (having another distributed
index and separate scoring function), or is it better to store the index
for dense vectors outside of Lucine?

Thank you for your insight and time,
Jimmy

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message