lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mathias Lux <m...@itec.uni-klu.ac.at>
Subject Re: Stored fields: decompression slows down in my scenario ... any idea for a workaround?
Date Mon, 24 Jun 2013 07:05:52 GMT
Hi!

Thanks!! I'll try the DocValues for sure, and of course the smaller
chunk size. Just to add up on the number of bytes stored: it's for
instance 72 bytes for CEDD, ~96 for JCD, 64 bytes for
OpponentHistogram, etc. and there is 0<n<10 fields per image (aka
document).

cheers,
  Mathias

On Sun, Jun 23, 2013 at 9:08 PM, Savia Beson <eksdev@googlemail.com> wrote:
> Uwe,
> I think Mathias was talking about the case with many smallish fields that all get read
per document.  DV approach would mean seeking N times, while stored fields, only once? Or
you meant he should encode all his fields  into single byte[]?
>
> Or did I get it all wrong about stored vs DV :)
>
> What helped a lot in a similar case was to make own codec and reduce chunk size to something
smallish, depending on your average document sizeā€¦ there is a sweet spot somewhere compression/speed.
>
> Simply make your own Codec and delegate to:
>
> public final class MySmallishChunkStoredFieldFormat extends CompressingStoredFieldsFormat
{
>
>   /** Sole constructor. */
>   public MySmallishChunkStoredFieldFormat() {
>     //TODO: try different chunk sizes, maybe 1-2KB?
>     super("YourFormatName", CompressionMode.FAST, 1 << 12);
>   }
>
> }
>
>
> On Jun 23, 2013, at 7:40 PM, Uwe Schindler <uwe@thetaphi.de> wrote:
>
>> Hi,
>>
>> To do this type of processing, use the new DocValues field type. They are like FieldCache
but persisted to disk. Different datatypes exist and can be used to get random access based
on document number. They are organized as column-stride fields, means each column is a separate
data structure with random access like a big array (persisted on disk).
>>
>> Stored Fields should *only* ever be used to display search results!
>>
>> Uwe
>>
>> -----
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: uwe@thetaphi.de
>>
>>
>>> -----Original Message-----
>>> From: mathias.lux@gmail.com [mailto:mathias.lux@gmail.com] On Behalf Of
>>> Mathias Lux
>>> Sent: Sunday, June 23, 2013 7:27 PM
>>> To: java-user@lucene.apache.org
>>> Subject: Stored fields: decompression slows down in my scenario ... any idea
>>> for a workaround?
>>>
>>> Hi!
>>>
>>> I'm managing the development of LIRE
>>> (https://code.google.com/p/lire/), a image search toolbox based on Lucene.
>>> While optimizing different search routines for global image features I came
>>> around to take a look at the CPU usage, i.e. to see if my new distance
>>> function is faster than the old one :)
>>>
>>> Unfortunately I found out the the decompression routine for stored fields
>>> made up for nearly 60% of the search time. (see
>>> http://www.semanticmetadata.net/?p=1092)
>>>
>>> So what I basically do is to open each document in an index sequentially,
>>> check it upon distance to a query feature and maintain my result list. The
>>> image features are in stored fields, byte[] arrays. I optimized quite a lot to
>>> get them really small and fast to parse and store.
>>>
>>> I know that this is not the way Lucene is intended to use, I'm working with
>>> Lucene for years now :) And just to ensure you: approximate indexing and
>>> local feature search are based on terms, ... and fast.
>>> But linear search makes up an important part of LIRE, so I'd be glad to get
>>> some suggestions how either to disable compression, or how to sneak in
>>> byte[] data with some textual data that is "fast as hell" to read.
>>>
>>> cheers,
>>>  Mathias
>>>
>>> ps. I know that it'd be possible to write it to a data file, put it into memory
>>> and gain a lot of speed. But of course I'd prefer to maintain "just one" index
>>> and not two of them :)
>>>
>>> --
>>> Dr. Mathias Lux
>>> Assistant Professor, Klagenfurt University, Austria http://tinyurl.com/mlux-
>>> itec
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>



-- 
Dr. Mathias Lux
Assistant Professor, Klagenfurt University, Austria
http://tinyurl.com/mlux-itec

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message