arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Luke <>
Subject Re: parquet file in S3, is there a way to read a subset of all the columns in python
Date Thu, 11 Oct 2018 21:27:22 GMT
This works in boto3:

import boto3

obj = boto3.resource('s3').Object('mybucketfoo', 'foo')
stream = obj.get(Range='bytes=10-100')['Body']print(

On Thu, Oct 11, 2018 at 2:22 PM Uwe L. Korn <> wrote:

> Hello Luke,
> this is only partly implemented. You can do this and I already did do this
> but this is sadly not in a perfect state.
> boto3 itself seems to be lacking a proper file-like class. You can get the
> contents of a file in S3 as
> This sadly seems to be missing a seek method.
> In my case I did access parquet files on S3 with per-column access using
> the simplekv project. There a small file-like class is implemented on top
> of boto (but not boto3):
> .
> This is what you are looking for, just the wrong boto package as well as I
> know that this implementation is sadly leaking http-connections and thus
> when you access too many files (even in serial) at once, your network will
> suffer.
> Cheers
> Uwe
> On Thu, Oct 11, 2018, at 8:01 PM, Luke wrote:
> I have parquet files (each self contained) in S3 and I want to read
> certain columns into a pandas dataframe without reading the entire object
> out of S3.
> Is this implemented?  boto3 in python supports reading from offsets in an
> S3 object but I wasn't sure anyone has made that work with a parquet file
> corresponding to certain columns?
> thanks,
> Luke

View raw message