Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 47BC6200D1B for ; Thu, 12 Oct 2017 19:06:33 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 467611609E4; Thu, 12 Oct 2017 17:06:33 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id B52541609E8 for ; Thu, 12 Oct 2017 19:06:32 +0200 (CEST) Received: (qmail 51657 invoked by uid 500); 12 Oct 2017 17:06:31 -0000 Mailing-List: contact reviews-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list reviews@spark.apache.org Received: (qmail 51636 invoked by uid 99); 12 Oct 2017 17:06:31 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 12 Oct 2017 17:06:31 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 36369DFAF5; Thu, 12 Oct 2017 17:06:31 +0000 (UTC) From: HyukjinKwon To: reviews@spark.apache.org Reply-To: reviews@spark.apache.org References: In-Reply-To: Subject: [GitHub] spark pull request #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark ... Content-Type: text/plain Message-Id: <20171012170631.36369DFAF5@git1-us-west.apache.org> Date: Thu, 12 Oct 2017 17:06:31 +0000 (UTC) archived-at: Thu, 12 Oct 2017 17:06:33 -0000 Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19459#discussion_r144339301 --- Diff: python/pyspark/sql/session.py --- @@ -510,9 +511,43 @@ def createDataFrame(self, data, schema=None, samplingRatio=None, verifySchema=Tr except Exception: has_pandas = False if has_pandas and isinstance(data, pandas.DataFrame): - if schema is None: - schema = [str(x) for x in data.columns] - data = [r.tolist() for r in data.to_records(index=False)] + if self.conf.get("spark.sql.execution.arrow.enabled", "false").lower() == "true" \ + and len(data) > 0: + from pyspark.serializers import ArrowSerializer + from pyspark.sql.types import from_arrow_schema + import pyarrow as pa + + # Slice the DataFrame into batches + split = -(-len(data) // self.sparkContext.defaultParallelism) # round int up + slices = (data[i:i + split] for i in xrange(0, len(data), split)) --- End diff -- How about `split` -> `size` (or `length`) and `i` -> `offset` (or `start`)? --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org For additional commands, e-mail: reviews-help@spark.apache.org