Mailing-List: contact reviews-help@spark.apache.org; run by ezmlm
Precedence: bulk
From: HyukjinKwon <git@git.apache.org>
To: reviews@spark.apache.org
Reply-To: reviews@spark.apache.org
References: <git-pr-19459-spark@git.apache.org>
In-Reply-To: <git-pr-19459-spark@git.apache.org>
Subject: [GitHub] spark pull request #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark ...
Content-Type: text/plain
Message-Id: <20171012170631.36369DFAF5@git1-us-west.apache.org>
Date: Thu, 12 Oct 2017 17:06:31 +0000 (UTC)
archived-at: Thu, 12 Oct 2017 17:06:33 -0000

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19459#discussion_r144339301
  
    --- Diff: python/pyspark/sql/session.py ---
    @@ -510,9 +511,43 @@ def createDataFrame(self, data, schema=None, samplingRatio=None, verifySchema=Tr
             except Exception:
                 has_pandas = False
             if has_pandas and isinstance(data, pandas.DataFrame):
    -            if schema is None:
    -                schema = [str(x) for x in data.columns]
    -            data = [r.tolist() for r in data.to_records(index=False)]
    +            if self.conf.get("spark.sql.execution.arrow.enabled", "false").lower() == "true" \
    +                    and len(data) > 0:
    +                from pyspark.serializers import ArrowSerializer
    +                from pyspark.sql.types import from_arrow_schema
    +                import pyarrow as pa
    +
    +                # Slice the DataFrame into batches
    +                split = -(-len(data) // self.sparkContext.defaultParallelism)  # round int up
    +                slices = (data[i:i + split] for i in xrange(0, len(data), split))
    --- End diff --
    
    How about `split` -> `size` (or `length`) and `i` -> `offset` (or `start`)?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org