From notifications-return-16167-archive-asf-public=cust-asf.ponee.io@libcloud.apache.org Fri Oct 4 11:50:25 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 3D82D180651 for ; Fri, 4 Oct 2019 13:50:25 +0200 (CEST) Received: (qmail 20727 invoked by uid 500); 4 Oct 2019 11:50:24 -0000 Mailing-List: contact notifications-help@libcloud.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@libcloud.apache.org Delivered-To: mailing list notifications@libcloud.apache.org Received: (qmail 20718 invoked by uid 99); 4 Oct 2019 11:50:24 -0000 Received: from ec2-52-202-80-70.compute-1.amazonaws.com (HELO gitbox.apache.org) (52.202.80.70) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 04 Oct 2019 11:50:24 +0000 From: GitBox To: notifications@libcloud.apache.org Subject: [GitHub] [libcloud] pquentin opened a new pull request #1353: Reuse TCP connections when uploading files Message-ID: <157018982455.8493.3125406669558884939.gitbox@gitbox.apache.org> Date: Fri, 04 Oct 2019 11:50:24 -0000 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit pquentin opened a new pull request #1353: Reuse TCP connections when uploading files URL: https://github.com/apache/libcloud/pull/1353 ## Reuse TCP connections when uploading files) ### Description It's easy to break connection reuse when using the requests API: just use `stream=True` and never read the response. The connection used to make the request will never be reused, and will be dropped when the urllib3's connection pool is full. It turns out uploading objects using the S3 API goes through `prepared_request`, which incorrectly sets `stream` to the value of `raw`, `True` in our case. And since we don't read the response data, the connection are never reused, and each upload requires its own connection. This is particularly wasteful when uploading many small objects, which can easily happen with JSON or Parquet files generated by Apache Spark, where setting up the connection takes significant time compared to uploading a few bytes. Setting `stream=stream` in the `prepared_request` method matches the code in the `request` method and fixes the bug. ### Status - work in progress ### Checklist (tick everything that applies) - [x] [Code linting](http://libcloud.readthedocs.org/en/latest/development.html#code-style-guide) (required, can be done after the PR checks) - [x] Documentation - [x] [Tests](http://libcloud.readthedocs.org/en/latest/testing.html) - [x] [ICLA](http://libcloud.readthedocs.org/en/latest/development.html#contributing-bigger-changes) (required for bigger changes) cc @Kami @tonybaloney ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org With regards, Apache Git Services