lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Toke Eskildsen (JIRA)" <>
Subject [jira] Commented: (LUCENE-1990) Add unsigned packed int impls in oal.util
Date Wed, 20 Jan 2010 16:44:54 GMT


Toke Eskildsen commented on LUCENE-1990:

Introducing yet another level of indirection and making a byte/short/int/long-prvider detached
from the implementation of the packed values it tempting.

You mean the layer that stores the minValue, so that the full range is
supported? I actually think we should absorb that into packed ints,
so it's only one method call per lookup, and specialize the "positive
only" cases to avoid the extra add per lookup.

No, I mean the backing array of ints or longs. For memory, the obvious choice is int[] or
long[], but designing for flexibility calls for an API, which coincidentally is identical
to Reader. So, a 4-bit Aligned could be backed by a DirectInt (which will contain an int[])
or a Persistent (doing mmap or some other persistent-oriented lookup).

...I should show this in code instead. I'll try and find the time.

Yeah, we do eventually want CSF to be updateable, but I don't think we
need this for phase 1?

Not in the specific scenario, no.

The whole 32bit vs. 64bit as backing array does present a bit of a problem with persistence.
We'll be in a situation where the index will be optimized for the architecture used for building,
not the one used for searching. Leaving the option of a future mmap open means that it is
not possible to do a conversion when retrieving the bits, so I have no solution for this (other
than doing memory-only).

I'm confused - a future mmap impl shouldn't put pressure on the file
format used by packed ints today? Ie, a future mmap impl can use a
totally different format than the designed-to-be-slurped-into-RAM
format for packed ints, today?

Also, what do you mean by optimized for building not searching?

When we're using Aligned, the choice of using int or long for the backing array dictates how
the persistent bitstream will be. If the index builder is a 64 bit machine and 7 bits/value
is used, the result will be longs, each containing 9 values.

When the structure is read into memory by the searcher, the backing array will again be long.
But if the searcher happens to be a 32 bit machine, the performance will be less than if ints
were used for the backing array.

One way to handle this is to do a conversion, when loading into memory. If the searcher is
64 bit, it will always convert into longs. If the searcher is 32 bit, it will convert into
int, if the bits/value is <= 32. The conversion is cheap, so this is no problem in itself.

However, if we're planning for the future (or for flexibility, depending on point of view),
we would very much like the persistent format to be directly usable, so that we don't need
to load the whole structure into memory. This rules out conversion and sets us back to step
1: The index will be optimized for either 32bit or 64 bit searchers.

Oh well, we could always just ignore it and say that Aligned is 64 bit based. As it is more
memory-efficient than Aligned on 32 bit machines, maybe the slightly smaller number of backing
longs will compensate for the overhead of retrieving longs on a 64 bit machine.

Note that on 32 bit machines, if there is actually a gain, we can make
a backing store with ints yet still allow for storage of nbits>32? It
"just" means a value may be split across 2 or 3 values?

My guess is that the number of values vs. the available level 2 cache will play a big role
here: For a relatively small number of values, the added logic will be costly. For a larger
number of values, the cache-misses will dwarf that.

> Add unsigned packed int impls in oal.util
> -----------------------------------------
>                 Key: LUCENE-1990
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-1990.patch,
> There are various places in Lucene that could take advantage of an
> efficient packed unsigned int/long impl.  EG the terms dict index in
> the standard codec in LUCENE-1458 could subsantially reduce it's RAM
> usage.  FieldCache.StringIndex could as well.  And I think "load into
> RAM" codecs like the one in TestExternalCodecs could use this too.
> I'm picturing something very basic like:
> {code}
> interface PackedUnsignedLongs  {
>   long get(long index);
>   void set(long index, long value);
> }
> {code}
> Plus maybe an iterator for getting and maybe also for setting.  If it
> helps, most of the usages of this inside Lucene will be "write once"
> so eg the set could make that an assumption/requirement.
> And a factory somewhere:
> {code}
>   PackedUnsignedLongs create(int count, long maxValue);
> {code}
> I think we should simply autogen the code (we can start from the
> autogen code in LUCENE-1410), or, if there is an good existing impl
> that has a compatible license that'd be great.
> I don't have time near-term to do this... so if anyone has the itch,
> please jump!

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message