lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Elschot (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1410) PFOR implementation
Date Fri, 03 Oct 2008 20:23:44 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12636730#action_12636730
] 

Paul Elschot commented on LUCENE-1410:
--------------------------------------

The number of bits is not really informative, it would be better to have a distribution of
the result of getNumFrameBits(). Then we can see for which nrs of bits loop unrolling might
be tried.

As for decompression speed, please remember that the patching code (that decodes the exceptions
into the output) has not yet been optimized at all.

The lucene .frq file contains the docids as deltas and the frequencies but with a special
encoding in case the frequency is one. I'd rather try compressing the real delta docids and
the real frequencies than the encoded versions.

The .prx file should be useable as it is, and from the results reported in the articles I
would expect a real performance advantage for PFor for that.
Michael, could you post a distribution of the number of frame bits for the .prx file? I'd
like to know a reasonable maximum for unrolling the corresponding loops.

>: maybe I'm doing something wrong
I don't think so, the code is still young. Try running TestPFor and have a look at the output
near the end for the case of 6 frame bits. That should show the unrolled decoding speed for
the case without exceptions. Then compare that to the cases with lower and higher nrs of frame
bits.

>: Stepping back a bit: I wonder what %tg of the time in a "typical"
Lucene search is spent decoding vInts? That would bound how much
improvement we could ever expect from this optimization during
searching.

A: there is also the influence of the reduction of data to be fetched (via various caches)
from the index. As reported in the articles, PFor strikes a nice optimum between decompression
speed and fetching speed from (fast) disks.

>: I was thinking local CPU's native asm.
A: I'd try a C version first. Iirc gcc has a decent optimizer for bit ops, but it's been a
while for me that I used C.

For the record, to show the variation in decompression speeds for different numbers of frame
bits without exceptions, here is my current output from TestPFor:
{noformat}
test901PerfFor1Decompress starting
Compressed 128 bytes into 8, ratio 0.0625
test901PerfFor1Decompress 0 numFrameBits 1 decompressed 104857600 in 1021 msecs, 102700 ints/msec,
(25 iters).
test901PerfFor1Decompress 1 numFrameBits 1 decompressed 171966464 in 1020 msecs, 168594 ints/msec,
(41 iters).
test901PerfFor1Decompress 2 numFrameBits 1 decompressed 171966464 in 1012 msecs, 169927 ints/msec,
(41 iters).

test902PerfFor2Decompress starting
Compressed 128 bytes into 12, ratio 0.09375
test902PerfFor2Decompress 0 numFrameBits 2 decompressed 130023424 in 1017 msecs, 127849 ints/msec,
(31 iters).
test902PerfFor2Decompress 1 numFrameBits 2 decompressed 155189248 in 1000 msecs, 155189 ints/msec,
(37 iters).
test902PerfFor2Decompress 2 numFrameBits 2 decompressed 159383552 in 1022 msecs, 155952 ints/msec,
(38 iters).

test903PerfFor3Decompress starting
Compressed 128 bytes into 16, ratio 0.125
test903PerfFor3Decompress 0 numFrameBits 3 decompressed 188743680 in 1016 msecs, 185771 ints/msec,
(45 iters).
test903PerfFor3Decompress 1 numFrameBits 3 decompressed 205520896 in 1018 msecs, 201886 ints/msec,
(49 iters).
test903PerfFor3Decompress 2 numFrameBits 3 decompressed 201326592 in 1026 msecs, 196224 ints/msec,
(48 iters).

test904PerfFor4Decompress starting
Compressed 128 bytes into 20, ratio 0.15625
test904PerfFor4Decompress 0 numFrameBits 4 decompressed 146800640 in 1014 msecs, 144773 ints/msec,
(35 iters).
test904PerfFor4Decompress 1 numFrameBits 4 decompressed 159383552 in 1007 msecs, 158275 ints/msec,
(38 iters).
test904PerfFor4Decompress 2 numFrameBits 4 decompressed 159383552 in 1011 msecs, 157649 ints/msec,
(38 iters).

test905PerfFor5Decompress starting
Compressed 128 bytes into 24, ratio 0.1875
test905PerfFor5Decompress 0 numFrameBits 5 decompressed 159383552 in 1010 msecs, 157805 ints/msec,
(38 iters).
test905PerfFor5Decompress 1 numFrameBits 5 decompressed 176160768 in 1009 msecs, 174589 ints/msec,
(42 iters).
test905PerfFor5Decompress 2 numFrameBits 5 decompressed 176160768 in 1004 msecs, 175458 ints/msec,
(42 iters).

test906PerfFor6Decompress starting
Compressed 128 bytes into 28, ratio 0.21875
test906PerfFor6Decompress 0 numFrameBits 6 decompressed 117440512 in 1001 msecs, 117323 ints/msec,
(28 iters).
test906PerfFor6Decompress 1 numFrameBits 6 decompressed 125829120 in 1002 msecs, 125577 ints/msec,
(30 iters).
test906PerfFor6Decompress 2 numFrameBits 6 decompressed 125829120 in 1004 msecs, 125327 ints/msec,
(30 iters).

test907PerfFor7Decompress starting
Compressed 128 bytes into 32, ratio 0.25
test907PerfFor7Decompress 0 numFrameBits 7 decompressed 125829120 in 1016 msecs, 123847 ints/msec,
(30 iters).
test907PerfFor7Decompress 1 numFrameBits 7 decompressed 150994944 in 1015 msecs, 148763 ints/msec,
(36 iters).
test907PerfFor7Decompress 2 numFrameBits 7 decompressed 150994944 in 1012 msecs, 149204 ints/msec,
(36 iters).

test908PerfFor8Decompress starting
Compressed 124 bytes into 35, ratio 0.28225806
test908PerfFor8Decompress 0 numFrameBits 8 decompressed 85327872 in 1015 msecs, 84066 ints/msec,
(21 iters).
test908PerfFor8Decompress 1 numFrameBits 8 decompressed 121896960 in 1021 msecs, 119389 ints/msec,
(30 iters).
test908PerfFor8Decompress 2 numFrameBits 8 decompressed 121896960 in 1022 msecs, 119272 ints/msec,
(30 iters).

test909PerfFor9Decompress starting
Compressed 124 bytes into 39, ratio 0.31451613
test909PerfFor9Decompress 0 numFrameBits 9 decompressed 52822016 in 1016 msecs, 51990 ints/msec,
(13 iters).
test909PerfFor9Decompress 1 numFrameBits 9 decompressed 56885248 in 1028 msecs, 55335 ints/msec,
(14 iters).
test909PerfFor9Decompress 2 numFrameBits 9 decompressed 56885248 in 1029 msecs, 55282 ints/msec,
(14 iters).
{noformat}
The loop for 9 bits is not unrolled, unrolling makes it a tiny little bit (55->54 Mint/sec)
slower.

> PFOR implementation
> -------------------
>
>                 Key: LUCENE-1410
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1410
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Other
>            Reporter: Paul Elschot
>            Priority: Minor
>         Attachments: LUCENE-1410b.patch, TestPFor2.java, TestPFor2.java
>
>   Original Estimate: 21840h
>  Remaining Estimate: 21840h
>
> Implementation of Patched Frame of Reference.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message