arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Colin Nichols <co...@bam-x.com>
Subject Re: Implementing (ARROW-1119) [Python] Enable reading Parquet data sets from Amazon S3
Date Thu, 22 Jun 2017 13:01:42 GMT
I am using a pa.PythonFile() wrapping the file-like object provided by s3fs package. I am able
to write parquet files directly to S3 this way. I am not reading using pyarrow (reading gzipped
csvs with python) but I imagine it would work much the same. 

-- sent from my phone --

> On Jun 22, 2017, at 00:54, Kevin Moore <kevin@quiltdata.io> wrote:
> 
> Has anyone started looking into how to read data sets from S3? I started
> looking into it and wondered if anyone has a design in mind.
> 
> We could implement an S3FileSystem class in pyarrow/filesystem.py. The
> filesystem components could probably be written against the AWS Python SDK.
> 
> The HDFS file system and file classes, however, are implemented at least
> partially in Cython & C++. Is there an advantage to doing that for S3 too?
> 
> Thanks,
> 
> Kevin
> 
> ----
> Kevin Moore
> CEO, Quilt Data, Inc.
> kevin@quiltdata.io | LinkedIn <https://www.linkedin.com/in/kevinemoore/>
> (415) 497-7895
> 
> 
> Data packages for fast, reproducible data science
> quiltdata.com

Mime
View raw message