arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: parquet file in S3, is there a way to read a subset of all the columns in python
Date Sun, 14 Oct 2018 19:29:23 GMT
You should be able to use s3fs, both the file handles it creates as
well as a filesystem to read multifile datasets:

https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_parquet.py#L1441
On Fri, Oct 12, 2018 at 12:03 PM Luke <virtualluke@gmail.com> wrote:
>
> It looks like https://github.com/dask/s3fs implements these methods.  Would there need
to be a wrapper over this for arrow or is it compatible as is?
>
> -Luke
>
> On Fri, Oct 12, 2018 at 9:13 AM Uwe L. Korn <uwelk@xhochy.com> wrote:
>>
>> That looks nice. Once you have wrapped that in a class that implements read and seek
like a Python file object, you should be able to pass this to `pyarrow.parquet.read_table`.
When you then set the columns argument on that function, only the respective byte ranges are
then requested from S3. To minimise the number of requests, I would suggest you to implement
the S3 file with the exact ranges provided from the outside but when using pyarrow, you should
wrap your S3 file in an io.BufferedReader. pyarrow.parquet requests exactly the ranges it
needs but that can sometimes be too coarse for object stores like S3. There you often like
to do the tradeoff of requesting some bytes more for a fewer number of requests.
>>
>> Uwe
>>
>>
>> On Thu, Oct 11, 2018, at 11:27 PM, Luke wrote:
>>
>> This works in boto3:
>>
>> import boto3
>>
>> obj = boto3.resource('s3').Object('mybucketfoo', 'foo')
>> stream = obj.get(Range='bytes=10-100')['Body']
>> print(stream.read())
>>
>>
>> On Thu, Oct 11, 2018 at 2:22 PM Uwe L. Korn <uwelk@xhochy.com> wrote:
>>
>>
>> Hello Luke,
>>
>> this is only partly implemented. You can do this and I already did do this but this
is sadly not in a perfect state.
>>
>> boto3 itself seems to be lacking a proper file-like class. You can get the contents
of a file in S3 as https://botocore.amazonaws.com/v1/documentation/api/latest/reference/response.html#botocore.response.StreamingBody
. This sadly seems to be missing a seek method.
>>
>> In my case I did access parquet files on S3 with per-column access using the simplekv
project. There a small file-like class is implemented on top of boto (but not boto3): https://github.com/mbr/simplekv/blob/master/simplekv/net/botostore.py#L93
. This is what you are looking for, just the wrong boto package as well as I know that this
implementation is sadly leaking http-connections and thus when you access too many files (even
in serial) at once, your network will suffer.
>>
>> Cheers
>> Uwe
>>
>>
>> On Thu, Oct 11, 2018, at 8:01 PM, Luke wrote:
>>
>> I have parquet files (each self contained) in S3 and I want to read certain columns
into a pandas dataframe without reading the entire object out of S3.
>>
>> Is this implemented?  boto3 in python supports reading from offsets in an S3 object
but I wasn't sure anyone has made that work with a parquet file corresponding to certain columns?
>>
>> thanks,
>> Luke
>>
>>
>>

Mime
View raw message