Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id BCCF0200D37 for ; Thu, 26 Oct 2017 01:09:48 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id BB64F160BE0; Wed, 25 Oct 2017 23:09:48 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 34457160BDA for ; Thu, 26 Oct 2017 01:09:48 +0200 (CEST) Received: (qmail 36791 invoked by uid 500); 25 Oct 2017 23:09:47 -0000 Mailing-List: contact reviews-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list reviews@spark.apache.org Received: (qmail 36776 invoked by uid 99); 25 Oct 2017 23:09:47 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 25 Oct 2017 23:09:47 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 16D68DFBC8; Wed, 25 Oct 2017 23:09:47 +0000 (UTC) From: cloud-fan To: reviews@spark.apache.org Reply-To: reviews@spark.apache.org Message-ID: Subject: [GitHub] spark pull request #19577: [SPARK-22355][SQL] Dataset.collect is not threads... Content-Type: text/plain Date: Wed, 25 Oct 2017 23:09:47 +0000 (UTC) archived-at: Wed, 25 Oct 2017 23:09:48 -0000 GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/19577 [SPARK-22355][SQL] Dataset.collect is not threadsafe ## What changes were proposed in this pull request? It's possible that users create a `Dataset`, and call `collect` of this `Dataset` in many threads at the same time. Currently `Dataset#collect` just call `encoder.fromRow` to convert spark rows to objects of type T, and this encoder is per-dataset. This means `Dataset#collect` is not thread-safe, because the encoder uses a projection to output the object to a re-usable row. This PR fixes this problem, by creating a new projection when calling `Dataset#collect`, so that we have the re-usable row for each method call, instead of each Dataset. ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark encoder Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19577.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19577 ---- commit cecea8cdb36f3c5e65abd08643bd0d181d72008d Author: Wenchen Fan Date: 2017-10-25T23:02:27Z Dataset.collect is not threadsafe ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org For additional commands, e-mail: reviews-help@spark.apache.org