lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mathias Lux <m...@itec.uni-klu.ac.at>
Subject Re: Is there a way to store binary data (byte[]) in DocValues?
Date Mon, 12 Aug 2013 16:25:01 GMT
Hi Robert,

I'm basically "mis-using" Solr for content based image search. So I
have indexed fields (hashes) for candidate selection, i.e. 1,500
candidate results retrieved with the IndexSearcher by hashes, which I
then have to re-rank based on numeric vectors I'm storing in byte[]
arrays. I had an implementation, where this is based on the binary
field but reading from an index with a lot of small stored field is
not a good idea with the current compression approach (I've already
discussed this in the Lucene user group :) BINARY is the thing for me
to go for, as you said, there's nothing, just the values.

Another thing for not using the the SORTED_SET and SORTED
implementations is, that Solr currently works with Strings on that and
I want to have a small memory footprint for millions of images ...
which does not go well with immutables.

However, I now already have a solution, which I just wanted to post
here when I saw your answer. Basically I copied the source from the
BinaryField and changed it to a BinaryDocValuesField (see line 68 at
http://pastebin.com/dscPTwhr). This works out well for indexing when
you adapt the schema to use this class:

[...]
<!-- ColorLayout -->
<field name="cl_ha" type="text_ws" indexed="true" stored="false"
required="false"/>
<field name="cl_hi" type="binaryDV"  indexed="false" stored="true"
required="false"/>
[...]
<fieldtype name="binaryDV"
class="net.semanticmetadata.lire.solr.BinaryDocValuesField"/>
[...]

I then have a custom request handler, that does the search for me.
First based on the hashes (field cl_ha, treated as whitespace
delimited terms) and then re-ranking the 1,500 first results based on
the DocValues.

Now it works rather fast, a demo with 1M images is available at
http://demo-itec.uni-klu.ac.at/liredemo/ .. hash based search time is
still not optimal, but that's an issue of the distribution of terms,
which is not optimal for this kind of index (find the runtime
separated in search & re-rank at the end of the page).

I'll put the whole (open, GPL-ed) source online at the end of
September (as module of LIRE), after some stress tests, documentation
and further bug fixing.

cheers,
  Mathias

On Mon, Aug 12, 2013 at 4:51 PM, Robert Muir <rcmuir@gmail.com> wrote:
> On Mon, Aug 12, 2013 at 8:38 AM, Mathias Lux <mlux@itec.uni-klu.ac.at> wrote:
>> Hi!
>>
>> I'm basically searching for a method to put byte[] data into Lucene
>> DocValues of type BINARY (see [1]). Currently only primitives and
>> Strings are supported according to [1].
>>
>> I know that this can be done with a custom update handler, but I'd
>> like to avoid that.
>>
>
> Can you describe a little bit what kind of operations you want to do with it?
> I don't really know how BinaryField is typically used, but maybe it
> could support this option. On the other hand adding it to BinaryField
> might not "buy" you much without some additional stuff depending upon
> what you need to do. Like if you really want to do sort/facet on the
> thing, SORTED(SET) would probably be a better implementation: it
> doesnt care that the values are binary.
>
> BINARY, SORTED, and SORTED_SET actually all take byte[]: the difference is:
> * SORTED: deduplicates/compresses the unique byte[]'s and gives each
> document an ordinal number that reflects sort order (for
> sorting/faceting/grouping/etc)
> * SORTED_SET: similar, except each document has a "set" (which can be
> empty), of ordinal numbers (e.g. for faceting multivalued fields)
> * BINARY: just stores the byte[] for each document (no deduplication,
> no compression, no ordinals, nothing).
>
> So for sorting/faceting: BINARY is generally not very efficient unless
> there is something custom going on: for example lucene's faceting
> package stores the "values" elsewhere in a separate taxonomy index, so
> it uses this type just to encode a delta-compressed ordinal list for
> each document.
>
> For scoring factors/function queries: encoding the values inside
> NUMERIC(s) [up to 64 bits each] might still be best on average: the
> compression applied here is surprisingly efficient.



-- 
Dr. Mathias Lux
Assistant Professor, Klagenfurt University, Austria
http://tinyurl.com/mlux-itec

Mime
View raw message