arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jdesjardins <jdesjard...@cloudera.com>
Subject Re: Implementing (ARROW-1119) [Python] Enable reading Parquet data sets from Amazon S3
Date Fri, 23 Jun 2017 17:52:24 GMT




Sent from my Verizon, Samsung Galaxy smartphone
-------- Original message --------From: Wes McKinney <wesmckinn@gmail.com> Date: 6/23/17
 12:44 PM  (GMT-05:00) To: dev@arrow.apache.org Subject: Re: Implementing (ARROW-1119) [Python]
Enable reading Parquet data sets from Amazon S3 
I started a conversation with the DMLC developers who have C++11
implementations of both S3 and Azure FS that they are maintaining

https://github.com/dmlc/dmlc-core/issues/273

On Thu, Jun 22, 2017 at 9:18 AM, Wes McKinney <wesmckinn@gmail.com> wrote:

> If you want to use pure Python, you should probably just use the s3fs
> package. We should be able to get better throughput using C++ (and making
> using multithreading to make multiple requests for larger reads) -- the AWS
> C++ SDK probably has everything we need to make a really strong
> implementation.
>
> Dato/Turi created an S3 file source implementation in C++
> https://github.com/turi-code/SFrame/blob/master/oss_
> src/fileio/s3_fstream.hpp, that is BSD licensed and does not depend on
> the (quite large) AWS C++ SDK, so that might not be a bad place to start.
>
> On Thu, Jun 22, 2017 at 9:01 AM, Colin Nichols <colin@bam-x.com> wrote:
>
>> I am using a pa.PythonFile() wrapping the file-like object provided by
>> s3fs package. I am able to write parquet files directly to S3 this way. I
>> am not reading using pyarrow (reading gzipped csvs with python) but I
>> imagine it would work much the same.
>>
>> -- sent from my phone --
>>
>> > On Jun 22, 2017, at 00:54, Kevin Moore <kevin@quiltdata.io> wrote:
>> >
>> > Has anyone started looking into how to read data sets from S3? I started
>> > looking into it and wondered if anyone has a design in mind.
>> >
>> > We could implement an S3FileSystem class in pyarrow/filesystem.py. The
>> > filesystem components could probably be written against the AWS Python
>> SDK.
>> >
>> > The HDFS file system and file classes, however, are implemented at least
>> > partially in Cython & C++. Is there an advantage to doing that for S3
>> too?
>> >
>> > Thanks,
>> >
>> > Kevin
>> >
>> > ----
>> > Kevin Moore
>> > CEO, Quilt Data, Inc.
>> > kevin@quiltdata.io | LinkedIn <https://www.linkedin.com/in/kevinemoore/
>> >
>> > (415) 497-7895
>> >
>> >
>> > Data packages for fast, reproducible data science
>> > quiltdata.com
>>
>
>
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message