[ https://issues.apache.org/jira/browse/LUCENE4062?page=com.atlassian.jira.plugin.system.issuetabpanels:alltabpanel
]
Adrien Grand updated LUCENE4062:

Attachment: LUCENE4062.patch
Oh right, cases are swapped, I uploaded the wrong patch. Here is a correct version.
I guess there are two options:
1. always writing the most compact form to disk and deserializing it in a fast {{PackedInts.Mutable}}
implementation (depending on acceptableOverheadRatio),
2. writing a possibly aligned version on disk (depending on acceptableOverheadRatio) and
then selecting the fastest PackedInts.Mutable implementation that exactly matches bitsPerValue
(with no padding bits).
A third option could be to write padding bits ({{Packed64SingleBlock}} subclasses may have
such padding bits) as well, but I really dislike the fact that the ondisk format is implementationdependent.
Option 1 is likely to make deserialization slower since some decoding might occur (the copyData
method) but on the other hand, option 2 would prevent us from using implementations that add
padding bits (such as Packed64SingleBlock21 which has one padding bit every 3 21bits integers
or Packed64SingleBlock12 which has 4 padding bits every 5 12bits integers but not Packed64SingleBlock{1,2,4}
since 64%{1,2,4}=0).
I initially chose option 1 because I think it is nice to have Packed64SingleBlock21 when bitsPerValue
is close to 21, since it might be significantly faster than Packed64.
In order to know how slower deserialization is (option 1), I ran a microbenchmark which:
 loads a PackedInts.Reader from a ByteArrayDataInput,
 performs n random reads on the reader.
Here is the code for the microbenchmark in case you would like to run it.
{code}
int valueCount = 10000000;
int bitsPerValue = 21;
int[] offsets = new int[valueCount];
Random random = new Random();
for (int i = 0; i < valueCount; ++i) {
offsets[i] = random.nextInt(valueCount);
}
byte[] bytes = new byte[valueCount * 4];
DataOutput out = new ByteArrayDataOutput(bytes);
PackedInts.Writer writer = PackedInts.getWriter(out, valueCount, bitsPerValue);
for (int i = 0; i < valueCount; ++i) {
writer.add(random.nextInt(1 << bitsPerValue));
}
writer.finish();
for (int i = 0; i < 50; ++i) {
long start = System.nanoTime();
DataInput in = new ByteArrayDataInput(bytes);
PackedInts.Reader reader = PackedInts.getReader(in, 0f); // Packed64
// PackedInts.Reader reader = PackedInts.getReader(in, 0.1f); // Packed64SingleBlock
for (int j = 0, n = valueCount; j < n; ++j) {
reader.get(offsets[j]);
}
long end = System.nanoTime();
System.out.println(end  start);
}
{code}
I ran this microbenchmark for bitsPerValue=21 and valueCount in (1 000 000, 10 000 000). The
loading time (n = 0) is 2x to 3x slower with Packed64SingleBlock21. However, as soon as you
perform valueCount/4 or more random reads (n >= valueCount/4), the total time is better
for Packed64SingleBlock21. When n=valueCount, the total time is even 2x better for Packed64SingleBlock21.
The loading overhead doesn't seem too bad.
However I guess that some people might still be more interested in the loading time than in
the query time (for writeintensive applications), but in this case they could still request
a reader in its compact form (Packed 64 or Direct* when bitsPerValue is 8, 16, 32 or 64).
If they want to be sure to have a fast reader too, they could also make sure that they used
8, 16, 32 or 64 as the bitsPerValue parameter of {{PackedInts.getWriter}}.
Mike, what do you think?
> More finegrained control over the packed integer implementation that is chosen
> 
>
> Key: LUCENE4062
> URL: https://issues.apache.org/jira/browse/LUCENE4062
> Project: Lucene  Java
> Issue Type: Improvement
> Components: core/other
> Reporter: Adrien Grand
> Assignee: Michael McCandless
> Priority: Minor
> Labels: performance
> Fix For: 4.1
>
> Attachments: LUCENE4062.patch, LUCENE4062.patch, LUCENE4062.patch, LUCENE4062.patch,
LUCENE4062.patch, LUCENE4062.patch
>
>
> In order to save space, Lucene has two main PackedInts.Mutable implentations, one that
is very fast and is based on a byte/short/integer/long array (Direct*) and another one which
packs bits in a memoryefficient manner (Packed*).
> The packed implementation tends to be much slower than the direct one, which discourages
some Lucene components to use it. On the other hand, if you store 21 bits integers in a Direct32,
this is a space loss of (3221)/32=35%.
> If you accept to trade some space for speed, you could store 3 of these 21 bits integers
in a long, resulting in an overhead of 1/3 bit per value. One advantage of this approach is
that you never need to read more than one block to read or write a value, so this can be significantly
faster than Packed32 and Packed64 which always need to read/write two blocks in order to avoid
costly branches.
> I ran some tests, and for 10000000 21 bits values, this implementation takes less than
2% more space and has 44% faster writes and 30% faster reads. The 12 bits version (5 values
per block) has the same performance improvement and a 6% memory overhead compared to the packed
implementation.
> In order to select the best implementation for a given integer size, I wrote the {{PackedInts.getMutable(valueCount,
bitsPerValue, acceptableOverheadPerValue)}} method. This method select the fastest implementation
that has less than {{acceptableOverheadPerValue}} wasted bits per value. For example, if you
accept an overhead of 20% ({{acceptableOverheadPerValue = 0.2f * bitsPerValue}}), which is
pretty reasonable, here is what implementations would be selected:
> * 1: Packed64SingleBlock1
> * 2: Packed64SingleBlock2
> * 3: Packed64SingleBlock3
> * 4: Packed64SingleBlock4
> * 5: Packed64SingleBlock5
> * 6: Packed64SingleBlock6
> * 7: Direct8
> * 8: Direct8
> * 9: Packed64SingleBlock9
> * 10: Packed64SingleBlock10
> * 11: Packed64SingleBlock12
> * 12: Packed64SingleBlock12
> * 13: Packed64
> * 14: Direct16
> * 15: Direct16
> * 16: Direct16
> * 17: Packed64
> * 18: Packed64SingleBlock21
> * 19: Packed64SingleBlock21
> * 20: Packed64SingleBlock21
> * 21: Packed64SingleBlock21
> * 22: Packed64
> * 23: Packed64
> * 24: Packed64
> * 25: Packed64
> * 26: Packed64
> * 27: Direct32
> * 28: Direct32
> * 29: Direct32
> * 30: Direct32
> * 31: Direct32
> * 32: Direct32
> * 33: Packed64
> * 34: Packed64
> * 35: Packed64
> * 36: Packed64
> * 37: Packed64
> * 38: Packed64
> * 39: Packed64
> * 40: Packed64
> * 41: Packed64
> * 42: Packed64
> * 43: Packed64
> * 44: Packed64
> * 45: Packed64
> * 46: Packed64
> * 47: Packed64
> * 48: Packed64
> * 49: Packed64
> * 50: Packed64
> * 51: Packed64
> * 52: Packed64
> * 53: Packed64
> * 54: Direct64
> * 55: Direct64
> * 56: Direct64
> * 57: Direct64
> * 58: Direct64
> * 59: Direct64
> * 60: Direct64
> * 61: Direct64
> * 62: Direct64
> Under 32 bits per value, only 13, 17 and 2226 bits per value would still choose the
slower Packed64 implementation. Allowing a 50% overhead would prevent the packed implementation
to be selected for bits per value under 32. Allowing an overhead of 32 bits per value would
make sure that a Direct* implementation is always selected.
> Next steps would be to:
> * make lucene components use this {{getMutable}} method and let users decide what tradeoff
better suits them,
> * write a Packed32SingleBlock implementation if necessary (I didn't do it because I
have no 32bits computer to test the performance improvements).
> I think this would allow more finegrained control over the speed/space tradeoff, what
do you think?

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

To unsubscribe, email: devunsubscribe@lucene.apache.org
For additional commands, email: devhelp@lucene.apache.org
