Mailing-List: contact dev-help@hc.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "HttpComponents Project" <dev@hc.apache.org>
Date: Sat, 15 Feb 2014 16:23:20 +0000 (UTC)
From: "Sebastiano Vigna (JIRA)" <jira@apache.org>
To: dev@hc.apache.org
Message-ID: <JIRA.12695366.1392481326412.49184.1392481400007@arcas>
In-Reply-To: <JIRA.12695366.1392481326412@arcas>
References: <JIRA.12695366.1392481326412@arcas>
Subject: [jira] [Created] (HTTPCLIENT-1461) GZIP decoding is very slow
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit

Sebastiano Vigna created HTTPCLIENT-1461:
--------------------------------------------

             Summary: GZIP decoding is very slow
                 Key: HTTPCLIENT-1461
                 URL: https://issues.apache.org/jira/browse/HTTPCLIENT-1461
             Project: HttpComponents HttpClient
          Issue Type: Bug
          Components: HttpClient
    Affects Versions: 4.3.2
            Reporter: Sebastiano Vigna
            Priority: Critical


In 4.3.1, LazyDecompressingInputStream was introduced. However, LazyDecompressingInputStream subclasses InputStream without overriding the multi-byte read() method, and the inherited method does a byte-by-byte read. 

This is trace showing what happens:

       java.util.zip.Inflater.inflateBytes(Inflater.java:Unknown line)
       java.util.zip.Inflater.inflate(Inflater.java:259)
       java.util.zip.InflaterInputStream.read(InflaterInputStream.java:152)
       java.util.zip.GZIPInputStream.read(GZIPInputStream.java:116)
       java.util.zip.InflaterInputStream.read(InflaterInputStream.java:122)
       org.apache.http.client.entity.LazyDecompressingInputStream.read(LazyDecompressingInputStream.java:56)
       java.io.InputStream.read(InputStream.java:179)
       it.unimi.di.law.warc.util.InspectableCachedHttpEntity.copyContent(InspectableCachedHttpEntity.java:67)

copyContent() would love to read(byte[],int,int) in a buffer, but since LazyDecompressingInputStream doesn't override it it invokes instead the read-byte-by-byte inherited method in InputStream, which in turn now calls for each byte the one-byte read() method from LazyDecompressingInputStream, which invokes the one-byte read method from InflaterInputStream, which does a multi-byte, length-one read from GZIPInputStream, which unleashes a similar call on InflaterInputStream, which unfortunately makes a similar read using the native inflateBytes() method.

Thus, for each byte there is a native-method call. The result is a 10-50x increase in CPU usage, which turns into a 10x-50x decrease in speed if, as in our case, you have 7000 threads downloading in parallel.

Overriding read(byte[],int,int) in LazyDecompressingInputStream will solve the problem:

    @Override
    public int read(byte[] b, int off, int len) throws IOException {
        initWrapper();
        return wrapperStream.read(b, off, len);
    }


--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@hc.apache.org
For additional commands, e-mail: dev-help@hc.apache.org