[ https://issues.apache.org/jira/browse/LUCENE5145?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=13722388#comment13722388
]
Boaz Leskes edited comment on LUCENE5145 at 7/29/13 12:25 PM:

While making the above changes I did some measurements which I feel is also useful to share.
PackedInts trade CPU for better CPU cache & memory usage. PackedInts gives you an acceptableOverheadRatio
parameter to control the trade off but is not exposed in the AbstractAppendingLongBuffer family
is based on those. This is especially important when you do no rely on the AbstractAppendingLongBuffer.iterator()
to extract your data. Here is some experiments I run on my laptop, using BenchmarkAppendLongBufferRead
which is included in the patch. The program allows you to play with different read strategies
and data size and measure reading times.
This is the result of using AppendingDeltaPackedLongBuffer (previously called AppendingLongBuffer)
to sequential read an array of 500000 elements, using it's get method. The data was uniformly
distributed numbers between 0 & 7. The program measure 10,000 such read. The total time
is the time it took to perform all of them. You also see in the output the number of bits
used to store the elements and the storage class used.
 Storage: DELTA_PACKED, Read: SEQUENTIAL, Read size: 1
SINGLE GET: 3 bits ratio 0.00 (i.e., 3 bits) total time: 22.18s avg: 2.22ms, total
read: 2500000000 elm (class org.apache.lucene.util.packed.Packed64, 223.76kb)
SINGLE GET: 3 bits ratio 7.00 (i.e., 8 bits) total time: 19.14s avg: 1.91ms, total
read: 2500000000 elm (class org.apache.lucene.util.packed.Direct8, 521.13kb)
As you can see, when retrieving elements one by one, the byte based implementation slightly
faster. For comparison, the new AppendingPackedLongBuffer with the same setup:
 Storage: PACKED, Read: SEQUENTIAL, Read size: 1
SINGLE GET: 3 bits ratio 0.00 (i.e., 3 bits) total time: 16.69s avg: 1.67ms, total
read: 2500000000 elm (class org.apache.lucene.util.packed.Packed64, 219.76kb)
SINGLE GET: 3 bits ratio 7.00 (i.e., 8 bits) total time: 13.47s avg: 1.35ms, total
read: 2500000000 elm (class org.apache.lucene.util.packed.Direct8, 517.13kb)
Next to the fact that is faster, you see the same behavior.
For random reads, the classes display similar behavior:
 Storage: DELTA_PACKED, Read: RANDOM, Read size: 1
SINGLE GET: 3 bits ratio 0.00 (i.e., 3 bits) total time: 23.13s avg: 2.31ms, total
read: 2500000000 elm (class org.apache.lucene.util.packed.Packed64, 223.76kb)
SINGLE GET: 3 bits ratio 7.00 (i.e., 8 bits) total time: 19.38s avg: 1.94ms, total
read: 2500000000 elm (class org.apache.lucene.util.packed.Direct8, 521.13kb)
 Storage: PACKED, Read: RANDOM, Read size: 1
SINGLE GET: 3 bits ratio 0.00 (i.e., 3 bits) total time: 19.23s avg: 1.92ms, total
read: 2500000000 elm (class org.apache.lucene.util.packed.Packed64, 219.76kb)
SINGLE GET: 3 bits ratio 7.00 (i.e., 8 bits) total time: 15.95s avg: 1.60ms, total
read: 2500000000 elm (class org.apache.lucene.util.packed.Direct8, 517.13kb)
Next I looked at the effect of exposing the bulk reads offered by the PackedInts structures
in the AppendingLongBuffer family. Here is some results from the new packed implementation,
this time reading 4 & 16 consecutive elements in a single read.
 Storage: PACKED, Read: SEQUENTIAL, Read size: 4
SINGLE GET: 3 bits ratio 0.00 (i.e., 3 bits) total time: 11.16s avg: 1.12ms, total
read: 2500000000 elm (class org.apache.lucene.util.packed.Packed64, 219.76kb)
BULK GET: 3 bits ratio 0.00 (i.e., 3 bits) total time: 24.22s avg: 2.42ms, total
read: 2500000000 elm (class org.apache.lucene.util.packed.Packed64, 219.76kb)
SINGLE GET: 3 bits ratio 7.00 (i.e., 8 bits) total time: 8.35s avg: 0.84ms, total
read: 2500000000 elm (class org.apache.lucene.util.packed.Direct8, 517.13kb)
BULK GET: 3 bits ratio 7.00 (i.e., 8 bits) total time: 8.44s avg: 0.84ms, total
read: 2500000000 elm (class org.apache.lucene.util.packed.Direct8, 517.13kb)
 Storage: PACKED, Read: SEQUENTIAL, Read size: 16
SINGLE GET: 3 bits ratio 0.00 (i.e., 3 bits) total time: 9.63s avg: 0.96ms, total
read: 2500000000 elm (class org.apache.lucene.util.packed.Packed64, 219.76kb)
BULK GET: 3 bits ratio 0.00 (i.e., 3 bits) total time: 12.52s avg: 1.25ms, total
read: 2500000000 elm (class org.apache.lucene.util.packed.Packed64, 219.76kb)
SINGLE GET: 3 bits ratio 7.00 (i.e., 8 bits) total time: 7.46s avg: 0.75ms, total
read: 2500000000 elm (class org.apache.lucene.util.packed.Direct8, 517.13kb)
BULK GET: 3 bits ratio 7.00 (i.e., 8 bits) total time: 3.22s avg: 0.32ms, total
read: 2500000000 elm (class org.apache.lucene.util.packed.Direct8, 517.13kb)
As you can see the bulk read api for the Direct8 class, starts to pay off when reading 4 elements
or more. For the Packed64 it still slower when reading 16 elements at a time. On my MacBook
Air, it only started paying off at around 32 elements. This may be different on systems with
more CPU caches.
Those tests (and many more I played with) are of course very synthetic. I also run the new
classes under our implementation of faceted search. The results were similar.
was (Author: bleskes):
While making the above changes I did some measurements which I feel is also useful to
share.
PackedInts trade CPU for better CPU cache & memory usage. PackedInts gives you an acceptableOverheadRatio
parameter to control the trade off but is not exposed in the AbstractAppendingLongBuffer family
is based on those. This is especially important when you do no rely on the AbstractAppendingLongBuffer.iterator()
to extract your data. Here is some experiments I run on my laptop, using BenchmarkAppendLongBufferRead
which is included in the patch. The program allows you to play with different read strategies
and data size and measure reading times.
This is the result of using AppendingDeltaPackedLongBuffer (previously called AppendingLongBuffer)
to sequential read an array of 500000 elements, using it's get method. The data was uniformly
distributed numbers between 0 & 7. The program measure 10,000 such read. The total time
is the time it took to perform all of them. You also see in the output the number of bits
used to store the elements and the storage class used.
 Storage: DELTA_PACKED, Read: SEQUENTIAL, Read size: 1
SINGLE GET: 3 bits ratio 0.00 (i.e., 3 bits) total time: 22.18s avg: 2.22ms, total
read: 2500000000 elm (class org.apache.lucene.util.packed.Packed64, 223.76kb)
SINGLE GET: 3 bits ratio 7.00 (i.e., 8 bits) total time: 19.14s avg: 1.91ms, total
read: 2500000000 elm (class org.apache.lucene.util.packed.Direct8, 521.13kb)
As you can see, when retrieving elements one by one, the byte based implementation slightly
faster. For comparison, the new AppendingPackedLongBuffer with the same setup:
 Storage: PACKED, Read: SEQUENTIAL, Read size: 1
SINGLE GET: 3 bits ratio 0.00 (i.e., 3 bits) total time: 16.69s avg: 1.67ms, total
read: 2500000000 elm (class org.apache.lucene.util.packed.Packed64, 219.76kb)
SINGLE GET: 3 bits ratio 7.00 (i.e., 8 bits) total time: 13.47s avg: 1.35ms, total
read: 2500000000 elm (class org.apache.lucene.util.packed.Direct8, 517.13kb)
Next to the fact that is faster, you see the same behavior.
For random reads, the classes display similar behavior:
 Storage: DELTA_PACKED, Read: RANDOM, Read size: 1
SINGLE GET: 3 bits ratio 0.00 (i.e., 3 bits) total time: 23.13s avg: 2.31ms, total
read: 2500000000 elm (class org.apache.lucene.util.packed.Packed64, 223.76kb)
SINGLE GET: 3 bits ratio 7.00 (i.e., 8 bits) total time: 19.38s avg: 1.94ms, total
read: 2500000000 elm (class org.apache.lucene.util.packed.Direct8, 521.13kb)
 Storage: PACKED, Read: RANDOM, Read size: 1
SINGLE GET: 3 bits ratio 0.00 (i.e., 3 bits) total time: 19.23s avg: 1.92ms, total
read: 2500000000 elm (class org.apache.lucene.util.packed.Packed64, 219.76kb)
SINGLE GET: 3 bits ratio 7.00 (i.e., 8 bits) total time: 15.95s avg: 1.60ms, total
read: 2500000000 elm (class org.apache.lucene.util.packed.Direct8, 517.13kb)
Next I looked at the effect of exposing the bulk reads offered by the PackedInts structures
in the AppendingLongBuffer family. Here is some results from the new packed implementation,
this time reading 4 & 16 consecutive elements in a single read.
 Storage: PACKED, Read: SEQUENTIAL, Read size: 4
SINGLE GET: 3 bits ratio 0.00 (i.e., 3 bits) total time: 11.16s avg: 1.12ms, total
read: 2500000000 elm (class org.apache.lucene.util.packed.Packed64, 219.76kb)
BULK GET: 3 bits ratio 0.00 (i.e., 3 bits) total time: 24.22s avg: 2.42ms, total
read: 2500000000 elm (class org.apache.lucene.util.packed.Packed64, 219.76kb)
SINGLE GET: 3 bits ratio 7.00 (i.e., 8 bits) total time: 8.35s avg: 0.84ms, total
read: 2500000000 elm (class org.apache.lucene.util.packed.Direct8, 517.13kb)
BULK GET: 3 bits ratio 7.00 (i.e., 8 bits) total time: 8.44s avg: 0.84ms, total
read: 2500000000 elm (class org.apache.lucene.util.packed.Direct8, 517.13kb)
 Storage: PACKED, Read: SEQUENTIAL, Read size: 16
SINGLE GET: 3 bits ratio 0.00 (i.e., 3 bits) total time: 9.63s avg: 0.96ms, total
read: 2500000000 elm (class org.apache.lucene.util.packed.Packed64, 219.76kb)
BULK GET: 3 bits ratio 0.00 (i.e., 3 bits) total time: 12.52s avg: 1.25ms, total
read: 2500000000 elm (class org.apache.lucene.util.packed.Packed64, 219.76kb)
SINGLE GET: 3 bits ratio 7.00 (i.e., 8 bits) total time: 7.46s avg: 0.75ms, total
read: 2500000000 elm (class org.apache.lucene.util.packed.Direct8, 517.13kb)
BULK GET: 3 bits ratio 7.00 (i.e., 8 bits) total time: 3.22s avg: 0.32ms, total
read: 2500000000 elm (class org.apache.lucene.util.packed.Direct8, 517.13kb)
As you can see the bulk read api for the Direct8 class, starts to pay off when reading 4 elements
or more. For the Packed64 it still slower when reading 16 elements at a time. On my MacBook
Air, it only started paying off at 1024 elements. This maybe different on systems with more
CPU caches.
Those tests (and many more I played with) are of course very synthetic. I also run the new
classes under our implementation of faceted search. The results were similar.
> Added AppendingPackedLongBuffer & extended AbstractAppendingLongBuffer family (customizable
compression ratio + bulk retrieval)
> 
>
> Key: LUCENE5145
> URL: https://issues.apache.org/jira/browse/LUCENE5145
> Project: Lucene  Core
> Issue Type: Improvement
> Reporter: Boaz Leskes
> Attachments: LUCENE5145.patch
>
>
> Made acceptableOverheadRatio configurable
> Added bulk get to AbstractAppendingLongBuffer classes, for faster retrieval.
> Introduced a new variant, AppendingPackedLongBuffer which solely relies on PackedInts
as a backend. This new class is useful where people have nonnegative numbers with a fairly
uniform distribution over a fixed (limited) range. Ex. facets ordinals.
> To distinguish it from AppendingPackedLongBuffer, delta based AppendingLongBuffer was
renamed to AppendingDeltaPackedLongBuffer
> Fixed an Issue with NullReader where it didn't respect it's valueCount in bulk gets.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

To unsubscribe, email: devunsubscribe@lucene.apache.org
For additional commands, email: devhelp@lucene.apache.org
