lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From eksdev <>
Subject Re: StoredField
Date Sun, 17 Mar 2013 16:56:13 GMT
Hi Adrian, 
I cannot tell if such thing would make it less or more robust, just thinking aloud  :)

I am thinking of it as a way to somehow postpone byte->type conversion to the moment where
it is really needed.  Simply, keep byte[] around as long as possible.   
*Theoretically*, this should improve gc() and memory footprint for some types of downstream
processing. It all depends how easy would something like that be.

There is already a way to achieve this by using binary field type, …  hmmm, maybe some
hack to make Lucene think every field is binary wold be simple and robust enough? 
e.g. Visitor.transportOnlySerializedValuesWithoutTypeConversion()


By the way, the trick with tim-sort in Sorter worked great. For 1.1 Mio short documents, the
time to sort unsorted index on handful of stored fields went from 490 seconds to 380. 
Congrats and thanks for it! It also improved compression by 12% (very small, 4k chunk size)

On Mar 17, 2013, at 5:26 PM, Adrien Grand <> wrote:

> Hi,
> On Sun, Mar 17, 2013 at 2:58 PM, eksdev <> wrote:
>> sure, there is a way to make anything -> byte[] ;)
>> it looks like this byte[]->type conversion is done deep-down and this
>> visitor user-api gets already correct types  …
>> Maybe an idea would be to delay byte[] -> type conversion to field access
>> time, i do not know what mines would be on the road to do it.
>> use cases that require identity checks, or not locale specific sorting and
>> co would benefit from having row, serialised representations without type
>> conversion…. anyhow, I could switch overt to byte[] fields completely to do
>> ii…
> I understand that it is frustrating to perform a String -> byte[]
> conversion if Lucene just did the opposite. But because it needs to
> perform one random seek per document (on a file which is often large),
> the stored fields API is much slower than a String -> UTF-8 bytes
> conversion, so I think we should keep the API robust rather than
> allowing for these kinds of optimizations?
> -- 
> Adrien
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message