arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe L. Korn" <>
Subject Re: parquet file in S3, is there a way to read a subset of all the columns in python
Date Fri, 12 Oct 2018 13:13:35 GMT
That looks nice. Once you have wrapped that in a class that implements
read and seek like a Python file object, you should be able to pass this
to `pyarrow.parquet.read_table`. When you then set the columns argument
on that function, only the respective byte ranges are then requested
from S3. To minimise the number of requests, I would suggest you to
implement the S3 file with the exact ranges provided from the outside
but when using pyarrow, you should wrap your S3 file in an
io.BufferedReader. pyarrow.parquet requests exactly the ranges it needs
but that can sometimes be too coarse for object stores like S3. There
you often like to do the tradeoff of requesting some bytes more for a
fewer number of requests.

On Thu, Oct 11, 2018, at 11:27 PM, Luke wrote:
> This works in boto3:
> import boto3  obj = boto3.resource('s3').Object('mybucketfoo', 'foo')
> stream = obj.get(Range='bytes=10-100')['Body'] print(> 
> On Thu, Oct 11, 2018 at 2:22 PM Uwe L. Korn <> wrote:>> __
>> Hello Luke,
>> this is only partly implemented. You can do this and I already did do
>> this but this is sadly not in a perfect state.>> 
>> boto3 itself seems to be lacking a proper file-like class. You can
>> get the contents of a file in S3 as
>> . This sadly seems to be missing a seek method.>> 
>> In my case I did access parquet files on S3 with per-column access
>> using the simplekv project. There a small file-like class is
>> implemented on top of boto (but not boto3):
>> . This is what you are looking for, just the wrong boto package as
>> well as I know that this implementation is sadly leaking http-
>> connections and thus when you access too many files (even in serial)
>> at once, your network will suffer.>> 
>> Cheers
>> Uwe
>> On Thu, Oct 11, 2018, at 8:01 PM, Luke wrote:
>>> I have parquet files (each self contained) in S3 and I want to read
>>> certain columns into a pandas dataframe without reading the entire
>>> object out of S3.>>> 
>>> Is this implemented?  boto3 in python supports reading from offsets
>>> in an S3 object but I wasn't sure anyone has made that work with a
>>> parquet file corresponding to certain columns?>>> 
>>> thanks,
>>> Luke

View raw message