arrow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From w...@apache.org
Subject [arrow] branch master updated: ARROW-1455 [Python] Add Dockerfile for validating Dask integration
Date Wed, 01 Nov 2017 02:54:14 GMT
This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
     new 142e6ee  ARROW-1455 [Python] Add Dockerfile for validating Dask integration
142e6ee is described below

commit 142e6ee69bd6a4dc316d00d9efd6d86d119df075
Author: Heimir Sverrisson <heimir.sverrisson@gmail.com>
AuthorDate: Tue Oct 31 22:54:09 2017 -0400

    ARROW-1455 [Python] Add Dockerfile for validating Dask integration
    
    A Docker container is created with all the dependencies needed to pull down the Dask code
from Github and install it locally, together with Arrow, to run an integration test.
    
    Author: Heimir Sverrisson <heimir.sverrisson@gmail.com>
    
    Closes #1249 from heimir-sverrisson/hs/dockerize_dask and squashes the following commits:
    
    d146185b [Heimir Sverrisson] ARROW-1455 [Python] Add Dockerfile for validating Dask integration
---
 dev/{docker-compose.yml => dask_integration.sh}    | 19 ++---
 dev/dask_integration/Dockerfile                    | 88 ++++++++++++++++++++++
 dev/dask_integration/dask_integration.sh           | 49 ++++++++++++
 dev/docker-compose.yml                             |  5 ++
 dev/run_docker_compose.sh                          |  2 +-
 python/testing/README.md                           | 24 +++++-
 python/testing/dask_tests/test_dask_integration.py | 51 +++++++++++++
 7 files changed, 222 insertions(+), 16 deletions(-)

diff --git a/dev/docker-compose.yml b/dev/dask_integration.sh
old mode 100644
new mode 100755
similarity index 77%
copy from dev/docker-compose.yml
copy to dev/dask_integration.sh
index 7bd2cd4..d344328
--- a/dev/docker-compose.yml
+++ b/dev/dask_integration.sh
@@ -1,3 +1,5 @@
+#!/usr/bin/env bash
+#
 # Licensed to the Apache Software Foundation (ASF) under one or more
 # contributor license agreements.  See the NOTICE file distributed with
 # this work for additional information regarding copyright ownership.
@@ -14,17 +16,6 @@
 # limitations under the License.
 #
 
-version: '3'
-services:
-  gen_apidocs:
-    build: 
-      context: gen_apidocs
-    volumes:
-     - ../..:/apache-arrow
-  run_site:
-    build:
-      context: run_site
-    ports:
-    - "4000:4000"
-    volumes:
-     - ../..:/apache-arrow
+# Pass the service name to run_docker_compose.sh
+# Which validates environment and runs the service
+exec "$(dirname ${BASH_SOURCE})"/run_docker_compose.sh dask_integration
diff --git a/dev/dask_integration/Dockerfile b/dev/dask_integration/Dockerfile
new file mode 100644
index 0000000..f72ef8c
--- /dev/null
+++ b/dev/dask_integration/Dockerfile
@@ -0,0 +1,88 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+FROM ubuntu:14.04
+ADD . /apache-arrow
+WORKDIR /apache-arrow
+# Basic OS utilities
+RUN apt-get update && apt-get install -y \
+        wget \
+        git \
+        gcc \
+        g++
+# This will install conda in /home/ubuntu/miniconda
+RUN wget -O /tmp/miniconda.sh \
+    https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
+    bash /tmp/miniconda.sh -b -p /home/ubuntu/miniconda && \
+    rm /tmp/miniconda.sh
+# Create Conda environment
+ENV PATH="/home/ubuntu/miniconda/bin:${PATH}"
+RUN conda create -y -q -n test-environment \
+    python=3.6
+# Install dependencies
+RUN conda install -c conda-forge \
+    numpy \
+    pandas \
+    bcolz \
+    blosc \
+    bokeh \
+    boto3 \
+    chest \
+    cloudpickle \
+    coverage \
+    cytoolz \
+    distributed \
+    graphviz \
+    h5py \
+    ipython \
+    partd \
+    psutil \
+    "pytest<=3.1.1" \
+    scikit-image \
+    scikit-learn \
+    scipy \
+    sqlalchemy \
+    toolz
+# install pytables from defaults for now
+RUN conda install pytables
+
+RUN pip install -q git+https://github.com/dask/partd --upgrade --no-deps
+RUN pip install -q git+https://github.com/dask/zict --upgrade --no-deps
+RUN pip install -q git+https://github.com/dask/distributed --upgrade --no-deps
+RUN pip install -q git+https://github.com/mrocklin/sparse --upgrade --no-deps
+RUN pip install -q git+https://github.com/dask/s3fs --upgrade --no-deps
+
+RUN conda install -q -c conda-forge numba cython
+RUN pip install -q git+https://github.com/dask/fastparquet
+
+RUN pip install -q \
+    cachey \
+    graphviz \
+    moto \
+    pyarrow \
+    --upgrade --no-deps
+
+RUN pip install -q \
+    cityhash \
+    flake8 \
+    mmh3 \
+    pandas_datareader \
+    pytest-xdist \
+    xxhash \
+    pycodestyle
+
+CMD arrow/dev/dask_integration/dask_integration.sh
+
diff --git a/dev/dask_integration/dask_integration.sh b/dev/dask_integration/dask_integration.sh
new file mode 100755
index 0000000..f5a24e4
--- /dev/null
+++ b/dev/dask_integration/dask_integration.sh
@@ -0,0 +1,49 @@
+#!/usr/bin/env bash
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Set up environment and working directory
+cd /apache-arrow
+
+export ARROW_BUILD_TYPE=release
+export ARROW_HOME=$(pwd)/dist
+export PARQUET_HOME=$(pwd)/dist
+CONDA_BASE=/home/ubuntu/miniconda
+export LD_LIBRARY_PATH=$(pwd)/dist/lib:${CONDA_BASE}/lib:${LD_LIBRARY_PATH}
+
+# Allow for --user Python installation inside Docker
+export HOME=$(pwd)
+
+# Clean up and get the dask master branch from github
+rm -rf dask .local
+export GIT_COMMITTER_NAME="Nobody"
+export GIT_COMMITTER_EMAIL="nobody@nowhere.com"
+git clone https://github.com/dask/dask.git
+pushd dask
+pip install --user -e .[complete]
+# Verify integrity of the installed dask dataframe code
+py.test dask/dataframe/tests/test_dataframe.py
+popd
+
+# Run the integration test
+pushd arrow/python/testing
+py.test dask_tests
+popd
+
+pushd dask/dask/dataframe/io
+py.test tests/test_parquet.py
+popd
diff --git a/dev/docker-compose.yml b/dev/docker-compose.yml
index 7bd2cd4..4b90148 100644
--- a/dev/docker-compose.yml
+++ b/dev/docker-compose.yml
@@ -28,3 +28,8 @@ services:
     - "4000:4000"
     volumes:
      - ../..:/apache-arrow
+  dask_integration:
+    build: 
+      context: dask_integration
+    volumes:
+     - ../..:/apache-arrow
diff --git a/dev/run_docker_compose.sh b/dev/run_docker_compose.sh
index f46879e..681a3a7 100755
--- a/dev/run_docker_compose.sh
+++ b/dev/run_docker_compose.sh
@@ -37,4 +37,4 @@ fi
 
 GID=$(id -g ${USERNAME})
 docker-compose -f arrow/dev/docker-compose.yml run \
-               -u "${UID}:${GID}" "${1}"
+               --rm -u "${UID}:${GID}" "${1}"
diff --git a/python/testing/README.md b/python/testing/README.md
index 07970a2..0ebeec4 100644
--- a/python/testing/README.md
+++ b/python/testing/README.md
@@ -23,4 +23,26 @@
 
 ```shell
 ./test_hdfs.sh
-```
\ No newline at end of file
+```
+
+## Testing Dask integration
+
+Initial integration testing with Dask has been Dockerized.
+To invoke the test run the following command in the `arrow`
+root-directory:
+
+```shell
+bash dev/dask_integration.sh
+```
+
+This script will create a `dask` directory on the same level as
+`arrow`. It will clone the Dask project from Github into `dask`
+and do a Python `--user` install. The Docker code will use the parent
+directory of `arrow` as `$HOME` and that's where Python will
+install `dask` into a `.local` directory.
+
+The output of the Docker session will contain the results of tests
+of the Dask dataframe followed by the single integration test that
+now exists for Arrow. That test creates a set of `csv`-files and then
+does parallel reading of `csv`-files into a Dask dataframe. The code
+for this test resides here in the `dask_test` directory.
diff --git a/python/testing/dask_tests/test_dask_integration.py b/python/testing/dask_tests/test_dask_integration.py
new file mode 100644
index 0000000..e678348
--- /dev/null
+++ b/python/testing/dask_tests/test_dask_integration.py
@@ -0,0 +1,51 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+from datetime import date, timedelta
+import csv
+from random import randint
+import dask.dataframe as dd
+import pyarrow as pa
+
+def make_datafiles(tmpdir, prefix='data', num_files=20):
+    rowcount = 5000
+    fieldnames = ['date', 'temperature', 'dewpoint']
+    start_date = date(1900, 1, 1)
+    for i in range(num_files):
+        filename = '{0}/{1}-{2}.csv'.format(tmpdir, prefix, i)
+        with open(filename, 'w') as outcsv:
+            writer = csv.DictWriter(outcsv, fieldnames)
+            writer.writeheader()
+            the_date = start_date
+            for _ in range(rowcount):
+                temperature = randint(-10, 35)
+                dewpoint = temperature - randint(0, 10)
+                writer.writerow({'date': the_date, 'temperature': temperature,
+                                 'dewpoint': dewpoint})
+                the_date += timedelta(days=1)
+
+def test_dask_file_read(tmpdir):
+    prefix = 'data'
+    make_datafiles(tmpdir, prefix)
+    # Read all datafiles in parallel
+    datafiles = '{0}/{1}-*.csv'.format(tmpdir, prefix)
+    dask_df = dd.read_csv(datafiles)
+    # Convert Dask dataframe to Arrow table
+    table = pa.Table.from_pandas(dask_df.compute())
+    # Second column (1) is temperature
+    dask_temp = int(1000 * dask_df['temperature'].mean().compute())
+    arrow_temp = int(1000 * table[1].to_pandas().mean())
+    assert dask_temp == arrow_temp

-- 
To stop receiving notification emails like this one, please contact
['"commits@arrow.apache.org" <commits@arrow.apache.org>'].

Mime
View raw message