lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Toke Eskildsen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-4062) More fine-grained control over the packed integer implementation that is chosen
Date Sun, 08 Jul 2012 20:07:34 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409056#comment-13409056
] 

Toke Eskildsen commented on LUCENE-4062:
----------------------------------------

One solution later and the mystery is still there: Tricking the JVM to request the old memory
value before setting the new one in Direct64 does increase its speed to the expected level
on the i7 server mars:
{code:java}
  public static long MASK = 0L;
  public void set(final int index, final long value) {
    values[index] = values[index] & MASK | value;
  }
{code}
I am guessing it somehow triggers either a pre-fetch of the relevant 4K page in main memory
or marks the cache as hotter and thus delaying cache flush. Alas, I am not a hardware guy
and this is a bit beyond me.

Anyway, the solution is not usable as it slows execution on other machines. I can get access
to another i7 server of the same generation, but it is identical to mars so I doubt the results
would be different. Unless someone steps up with another current-generation i7 server to test
on, I think we should file this under pixies.

Also unresolved, but more relevant, is the question about auto-selection of Mutable implementation.
Based on the data so far, the current auto-selector will result in sub-optimal selection on
i7-machines. Without a reliable architecture detection, another way of viewing the issue is
whether to optimize the selector towards older machines (padded) or newer (contiguous).

I've looked around for a way to determine the CPU and cache size, but so far it seems that
the only fairly reliable way is to determine the OS, then call some platform-specific native
code and parse the output. Besides the ugliness of this solution, I guess it gets disqualified
for the extra needed permissions from the security manager.
                
> More fine-grained control over the packed integer implementation that is chosen
> -------------------------------------------------------------------------------
>
>                 Key: LUCENE-4062
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4062
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/other
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>              Labels: performance
>             Fix For: 4.0, 5.0
>
>         Attachments: LUCENE-4062-2.patch, LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch,
LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, Packed64SingleBlock.java,
Packed64Strategy.java, Packed64calc.java, PackedIntsBenchmark.java, PackedIntsBenchmark.java,
PackedIntsBenchmark.java, PackedIntsBenchmark.java, PackedZero.java, measurements_te_graphs.pdf,
measurements_te_i7.txt, measurements_te_p4.txt, measurements_te_xeon.txt
>
>
> In order to save space, Lucene has two main PackedInts.Mutable implentations, one that
is very fast and is based on a byte/short/integer/long array (Direct*) and another one which
packs bits in a memory-efficient manner (Packed*).
> The packed implementation tends to be much slower than the direct one, which discourages
some Lucene components to use it. On the other hand, if you store 21 bits integers in a Direct32,
this is a space loss of (32-21)/32=35%.
> If you accept to trade some space for speed, you could store 3 of these 21 bits integers
in a long, resulting in an overhead of 1/3 bit per value. One advantage of this approach is
that you never need to read more than one block to read or write a value, so this can be significantly
faster than Packed32 and Packed64 which always need to read/write two blocks in order to avoid
costly branches.
> I ran some tests, and for 10000000 21 bits values, this implementation takes less than
2% more space and has 44% faster writes and 30% faster reads. The 12 bits version (5 values
per block) has the same performance improvement and a 6% memory overhead compared to the packed
implementation.
> In order to select the best implementation for a given integer size, I wrote the {{PackedInts.getMutable(valueCount,
bitsPerValue, acceptableOverheadPerValue)}} method. This method select the fastest implementation
that has less than {{acceptableOverheadPerValue}} wasted bits per value. For example, if you
accept an overhead of 20% ({{acceptableOverheadPerValue = 0.2f * bitsPerValue}}), which is
pretty reasonable, here is what implementations would be selected:
>  * 1: Packed64SingleBlock1
>  * 2: Packed64SingleBlock2
>  * 3: Packed64SingleBlock3
>  * 4: Packed64SingleBlock4
>  * 5: Packed64SingleBlock5
>  * 6: Packed64SingleBlock6
>  * 7: Direct8
>  * 8: Direct8
>  * 9: Packed64SingleBlock9
>  * 10: Packed64SingleBlock10
>  * 11: Packed64SingleBlock12
>  * 12: Packed64SingleBlock12
>  * 13: Packed64
>  * 14: Direct16
>  * 15: Direct16
>  * 16: Direct16
>  * 17: Packed64
>  * 18: Packed64SingleBlock21
>  * 19: Packed64SingleBlock21
>  * 20: Packed64SingleBlock21
>  * 21: Packed64SingleBlock21
>  * 22: Packed64
>  * 23: Packed64
>  * 24: Packed64
>  * 25: Packed64
>  * 26: Packed64
>  * 27: Direct32
>  * 28: Direct32
>  * 29: Direct32
>  * 30: Direct32
>  * 31: Direct32
>  * 32: Direct32
>  * 33: Packed64
>  * 34: Packed64
>  * 35: Packed64
>  * 36: Packed64
>  * 37: Packed64
>  * 38: Packed64
>  * 39: Packed64
>  * 40: Packed64
>  * 41: Packed64
>  * 42: Packed64
>  * 43: Packed64
>  * 44: Packed64
>  * 45: Packed64
>  * 46: Packed64
>  * 47: Packed64
>  * 48: Packed64
>  * 49: Packed64
>  * 50: Packed64
>  * 51: Packed64
>  * 52: Packed64
>  * 53: Packed64
>  * 54: Direct64
>  * 55: Direct64
>  * 56: Direct64
>  * 57: Direct64
>  * 58: Direct64
>  * 59: Direct64
>  * 60: Direct64
>  * 61: Direct64
>  * 62: Direct64
> Under 32 bits per value, only 13, 17 and 22-26 bits per value would still choose the
slower Packed64 implementation. Allowing a 50% overhead would prevent the packed implementation
to be selected for bits per value under 32. Allowing an overhead of 32 bits per value would
make sure that a Direct* implementation is always selected.
> Next steps would be to:
>  * make lucene components use this {{getMutable}} method and let users decide what trade-off
better suits them,
>  * write a Packed32SingleBlock implementation if necessary (I didn't do it because I
have no 32-bits computer to test the performance improvements).
> I think this would allow more fine-grained control over the speed/space trade-off, what
do you think?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message