lucene-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tomoko Uchida (Jira)" <j...@apache.org>
Subject [jira] [Comment Edited] (LUCENE-9004) Approximate nearest vector search
Date Sat, 23 Nov 2019 16:24:00 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16980772#comment-16980772
] 

Tomoko Uchida edited comment on LUCENE-9004 at 11/23/19 4:23 PM:
-----------------------------------------------------------------

Just for status update:
 [my PoC branch|https://github.com/mocobeta/lucene-solr-mirror/tree/jira/LUCENE-9004-aknn] is
still on pretty early stage and works only on one segment, but now it can index and query
arbitrary vectors by [this example code|https://gist.github.com/mocobeta/5c174ee9fc6408470057a9e7d2020c45].
The newly added KnnGraphQuery is an extension of Query class so it should be able to be combined
with other queries with some limitations, because the knn query cannot score entire dataset
in nature. Indexing performance is terrible for now (it takes a few minutes for a hundred
of thousands vectors w/ 100 dims on commodity PCs), but searching doesn't look too bad (it
takes ~30 msec for the same dataset) thanks to the skip list-like graph structure.

On my current branch I wrapped {{BinaryDocValues}} to store vector values. However, exposing
random access capability for doc values (or its extensions) can be controversial, so I'd like
to propose a new codec which combines 1. the HNSW graph and 2. the vectors (float arrays).

The new format for each vector field would have three parts (in other words, three files in
a segment). They would look like:
{code:java}
 
 Meta data and index part:
 +--------------------------------------------------+
 | meta data                                        |
 +--------+-----------------------------------------+
 | doc id | offset to first friend list for the doc |
 +--------+-----------------------------------------+
 | doc id | offset to first friend list for the doc |
 +--------+-----------------------------------------+
 |              ......                              |
 +--------+-----------------------------------------+

 Graph data part:
 +-------------------------+---------------------------+---------+-------------------------+
 | friends list at layer N | friends list at layer N-1 |  ...... | friends list at level 0
| <- friends lists for doc 0
 +-------------------------+---------------------------+---------+-------------------------+
 | friends list at layer N | friends list at layer N-1 |  ...... | friends list at level 0
| <- friends lists for doc 1
 +-------------------------+---------------------------+---------+-------------------------+
 |                            ......                                                     
 | <- and so on
 +-----------------------------------------------------------------------------------------+

 Vector data part:
 +----------------------+
 | encoded vector value | <- vector value for doc 0
 +----------------------+
 | encoded vector value | <- vector value for doc 1
 +----------------------+
 |   ......             | <- and so on
 +----------------------+

{code}
 - "meta data" includes: number of dimensions, distance function for similarity calculation,
and other field level meta data
 - "doc id" is: doc ids having a vector value on this field
 - "friends list at layer N" is: a delta encoded target doc id list where each target doc
is connected to the doc at Nth layer
 - "encoded vector value" is: a fixed length byte array. the offset of the value can be calculated
on the fly. (limitations: each document can have only one vector value for each vector field)

The graph data (friends lists) is relatively small so we could keep all of them on the Java
heap for fast retrieval (though some off-heap strategy might be required for very large graphs).
 The vector data (vector values) is large and only the small fraction of it is needed when
searching, so they should be kept on disk and accessed by some on-demand style.

Feedback is welcomed.

And I have a question about introducing new formats - is there a way to inject XXXFormat to
the indexing chain so that we can add in this feature without any change on the {{lucene-core}}?


was (Author: tomoko uchida):
Just for status update:
 [my PoC branch|https://github.com/mocobeta/lucene-solr-mirror/tree/jira/LUCENE-9004-aknn] is
still on pretty early stage and works only on one segment, but now it can index and query
arbitrary vectors by [this example code|https://gist.github.com/mocobeta/5c174ee9fc6408470057a9e7d2020c45].
The newly added KnnGraphQuery is an extension of Query class so it should be able to be combined
with other queries with some limitations, because the knn query cannot score entire dataset
in nature. Indexing performance is terrible for now (it takes a few minutes for a hundred
of thousands vectors w/ 100 dims on commodity PCs), but searching doesn't look too bad (it
takes ~30 msec for the same dataset) thanks to the skip list-like graph structure.

On my current branch I wrapped {{BinaryDocValues}} to store vector values. However, exposing
random access capability for doc values (or its extensions) can be controversial, so I'd like
to propose a new codec which combines 1. the HNSW graph and 2. the vectors (float arrays).

The new format for each vector field would have three parts (in other words, three files in
a segment). They would look like:
{code:java}
 
 Meta data and index part:
 +--------------------------------------------------+
 | meta data                                        |
 +--------+-----------------------------------------+
 | doc id | offset to first friend list for the doc |
 +--------+-----------------------------------------+
 | doc id | offset to first friend list for the doc |
 +--------+-----------------------------------------+
 |              ......                              |
 +--------+-----------------------------------------+

 Graph data part:
 +-------------------------+---------------------------+---------+-------------------------+
 | friends list at layer N | friends list at layer N-1 |  ...... | friends list at level 0
| <- friends lists for doc 0
 +-------------------------+---------------------------+---------+-------------------------+
 | friends list at layer N | friends list at layer N-1 |  ...... | friends list at level 0
| <- friends lists for doc 1
 +-------------------------+---------------------------+---------+-------------------------+
 |                            ......                                                     
 | <- and so on
 +-----------------------------------------------------------------------------------------+

 Vector data part:
 +----------------------+
 | encoded vector value | <- vector value for doc 0
 +----------------------+
 | encoded vector value | <- vector value for doc 1
 +----------------------+
 |   ......             | <- and so on
 +----------------------+

{code}
 - "meta data" includes: number of dimensions, distance function for similarity calculation,
and other field level meta data
 - "doc id" is: doc ids having a vector value on this field
 - "friends list at layer N" is: a delta encoded target doc id list where each target doc
is connected to the doc at Nth layer
 - "encoded vector value" is: a fixed length byte array. the offset of the value can be calculated
on the fly. (limitations: each document can have only one vector value for each vector field)

The graph data (friends lists) is relatively small so we could keep all of them on the Java
heap for fast retrieval (though some off-heap strategy might be required for very large graphs).
 The vector data (vector values) is large and only the small fraction of it is needed when
searching, so they should be accessed by on-demand style via the index.

Feedback is welcomed.

And I have a question about introducing new formats - is there a way to inject XXXFormat to
the indexing chain so that we can add in this feature without any change on the {{lucene-core}}?

> Approximate nearest vector search
> ---------------------------------
>
>                 Key: LUCENE-9004
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9004
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Michael Sokolov
>            Priority: Major
>         Attachments: hnsw_layered_graph.png
>
>
> "Semantic" search based on machine-learned vector "embeddings" representing terms, queries
and documents is becoming a must-have feature for a modern search engine. SOLR-12890 is exploring
various approaches to this, including providing vector-based scoring functions. This is a
spinoff issue from that.
> The idea here is to explore approximate nearest-neighbor search. Researchers have found
an approach based on navigating a graph that partially encodes the nearest neighbor relation
at multiple scales can provide accuracy > 95% (as compared to exact nearest neighbor calculations)
at a reasonable cost. This issue will explore implementing HNSW (hierarchical navigable small-world)
graphs for the purpose of approximate nearest vector search (often referred to as KNN or k-nearest-neighbor
search).
> At a high level the way this algorithm works is this. First assume you have a graph that
has a partial encoding of the nearest neighbor relation, with some short and some long-distance
links. If this graph is built in the right way (has the hierarchical navigable small world
property), then you can efficiently traverse it to find nearest neighbors (approximately)
in log N time where N is the number of nodes in the graph. I believe this idea was pioneered
in  [1]. The great insight in that paper is that if you use the graph search algorithm to
find the K nearest neighbors of a new document while indexing, and then link those neighbors
(undirectedly, ie both ways) to the new document, then the graph that emerges will have the
desired properties.
> The implementation I propose for Lucene is as follows. We need two new data structures
to encode the vectors and the graph. We can encode vectors using a light wrapper around {{BinaryDocValues}}
(we also want to encode the vector dimension and have efficient conversion from bytes to floats).
For the graph we can use {{SortedNumericDocValues}} where the values we encode are the docids
of the related documents. Encoding the interdocument relations using docids directly will
make it relatively fast to traverse the graph since we won't need to lookup through an id-field
indirection. This choice limits us to building a graph-per-segment since it would be impractical
to maintain a global graph for the whole index in the face of segment merges. However graph-per-segment
is a very natural at search time - we can traverse each segments' graph independently and
merge results as we do today for term-based search.
> At index time, however, merging graphs is somewhat challenging. While indexing we build
a graph incrementally, performing searches to construct links among neighbors. When merging
segments we must construct a new graph containing elements of all the merged segments. Ideally
we would somehow preserve the work done when building the initial graphs, but at least as
a start I'd propose we construct a new graph from scratch when merging. The process is going
to be  limited, at least initially, to graphs that can fit in RAM since we require random
access to the entire graph while constructing it: In order to add links bidirectionally we
must continually update existing documents.
> I think we want to express this API to users as a single joint {{KnnGraphField}} abstraction
that joins together the vectors and the graph as a single joint field type. Mostly it just
looks like a vector-valued field, but has this graph attached to it.
> I'll push a branch with my POC and would love to hear comments. It has many nocommits,
basic design is not really set, there is no Query implementation and no integration iwth IndexSearcher,
but it does work by some measure using a standalone test class. I've tested with uniform random
vectors and on my laptop indexed 10K documents in around 10 seconds and searched them at 95%
recall (compared with exact nearest-neighbor baseline) at around 250 QPS. I haven't made any
attempt to use multithreaded search for this, but it is amenable to per-segment concurrency.
> [1] https://www.semanticscholar.org/paper/Efficient-and-robust-approximate-nearest-neighbor-Malkov-Yashunin/699a2e3b653c69aff5cf7a9923793b974f8ca164



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Mime
View raw message