lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: svn commit: r1141510 - /lucene/dev/trunk/modules/facet/src/java/org/apache/lucene/util/UnsafeByteArrayOutputStream.java
Date Fri, 01 Jul 2011 06:33:39 GMT
Hi,

I don't understand the whole discussion here, so please compare these two implementations
and tell me which one is faster. Please don't hurt me, if you don't want to see src.jar code
from OpenJDK Java6 - just delete this mail if you don’t want to (the code here is licensed
under GPL):

Arrays.java:
    /**
     * Copies the specified array, truncating or padding with zeros (if necessary)
     * so the copy has the specified length.  For all indices that are
     * valid in both the original array and the copy, the two arrays will
     * contain identical values.  For any indices that are valid in the
     * copy but not the original, the copy will contain <tt>(byte)0</tt>.
     * Such indices will exist if and only if the specified length
     * is greater than that of the original array.
     *
     * @param original the array to be copied
     * @param newLength the length of the copy to be returned
     * @return a copy of the original array, truncated or padded with zeros
     *     to obtain the specified length
     * @throws NegativeArraySizeException if <tt>newLength</tt> is negative
     * @throws NullPointerException if <tt>original</tt> is null
     * @since 1.6
     */
    public static byte[] copyOf(byte[] original, int newLength) {
        byte[] copy = new byte[newLength];
        System.arraycopy(original, 0, copy, 0,
                         Math.min(original.length, newLength));
        return copy;
    }


This is our implementation, simon replaced and Robert reverted (UnsafeByteArrayOutputStream):

  private void grow(int newLength) {
    // It actually should be: (Java 1.7, when its intrinsic on all machines)
    // buffer = Arrays.copyOf(buffer, newLength);
    byte[] newBuffer = new byte[newLength];
    System.arraycopy(buffer, 0, newBuffer, 0, buffer.length);
    buffer = newBuffer;
  }

So please look at the code, where is a difference that could slow down, except the Math.min()
call which is an intrinsic in almost every JDK on earth?

The problem we are talking about here is only about the generic Object[] copyOf method and
also affects e.g. *all* Collection.toArray() methods - they all use this code, so whenever
you use ArrayList.toArray() or similar, the slow code is executed. This is why we replaced
Collections.sort() by CollectionUtil.sort, that does no array copy. Simon & me were not
willing to replace the reallocations in FST code (Mike you remember, we reverted that on your
GIT repo when we did perf tests) and other parts in Lucene (there are only few of them). The
idea was only to replace primitive type code to make it easier readable. And with later JDK
code it could even get faster (not slower), if Oracle starts to add intrinsics for those new
methods (and that’s Dawid and mine reason to change to copyTo for primitive types). In general,
if you use Java SDK methods, that are as fast as ours, they always have a chance to get faster
in later JDKs. So we should always prefer Java SDK methods, unless they are slower because
their default impl is too generic or has too much safety checks or uses reflection.


To come back to UnsafeByteArrayOutputStream:

I would change the whole code, as I don’t like the allocation strategy in it (it's exponential,
on every grow it doubles its size). We should change that to use ArrayUtils.grow() and ArrayUtils.oversize(),
to have a similar allocation strategy like in trunk. Then we can discuss about this problem
again when Simon & me wants to change ArrayUtils.grow methods to use Arrays.copyOf...
*g* [just joking, I will never ask again, because this discussion here is endless and does
not bring us forward].

The other thing I don’t like in the new faceting module is duplication of vint code. Why
don’t we change it to use DataInput/DataOutput and use Dawid's new In/OutStream wrapper
for DataOutput everywhere. This would be much more streamlined with all the code we currently
have. Then we can encode the payloads (or later docvalues) using the new UnsafeByteArrayOutputStream,
wrapped with a OutputStreamDataOutput wrapper? Or maybe add a ByteArrayDataOutput class.

Uwe
(getting crazy)

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Simon Willnauer [mailto:simon.willnauer@googlemail.com]
> Sent: Friday, July 01, 2011 7:47 AM
> To: Michael McCandless
> Cc: dev@lucene.apache.org; Dawid Weiss
> Subject: Re: svn commit: r1141510 -
> /lucene/dev/trunk/modules/facet/src/java/org/apache/lucene/util/Unsafe
> ByteArrayOutputStream.java
> 
> On Fri, Jul 1, 2011 at 12:19 AM, Michael McCandless
> <lucene@mikemccandless.com> wrote:
> > On Thu, Jun 30, 2011 at 4:45 PM, Simon Willnauer
> > <simon.willnauer@googlemail.com> wrote:
> >> On Thu, Jun 30, 2011 at 8:50 PM, Dawid Weiss
> >> <dawid.weiss@cs.put.poznan.pl> wrote:
> >>>> I don't seen any evidence that this is any slower though.
> >>>
> >>> You need to run with -client (if the machine is a beast this is
> >>> tricky because x64 will pick -server regardless of the command-line
> >>> setting) and you need to be copying generic arrays. I think this can
> >>> be shown
> >>> -- a caliper benchmark would be perfect to demonstrate this in
> >>> isolation; I may write one if I find a spare moment.
> >>
> >> this is what I want to see. I don't want to discuss based on some bug
> >> reported for a non-primitive version of copyOf thats all.
> >> its pointless to discuss if there is no evidence which I don't see. I
> >> am happy with arraycopy I would just have appreciated a discussion
> >> before backing the change out.
> >
> > I think the burden of proof here is on Arrays.copyOf.
> >
> > Ie, until we can prove (through benchmarking in different envs) that
> > it can be trusted, we should just stick with System.arraycopy to
> > reduce the risk.
> 
> Mike I think we should do that but the real issue here is what if somebody
> comes up with any arbitrary method in the future claiming its slow we back
> out and use the "we think safe way" what if it is actually the other way
> around and copyOf is optimized by new VMs and the copyarray is slightly
> slower. If robert would not have reverted this commit we would have not
> discussed that her at all. I just want to prevent the "we should not do this" it
> might be a problem in the big picture while the microbenchmark doesn't
> show a difference. At some point we have to rely on the JVM.
> Even if we benchmark on all OS with various JVMs we can't prevent jvm bug
> based perf hits. While there was no bug reported for primitives here we
> don't have to be afraid of it either. I don't think its saver to use arraycopy at
> all its just a level of indirection safer but makes development more of a pain
> IMO.
> 
> Just my $0.05
> >
> > Mike
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional
> commands, e-mail: dev-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message