arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe L. Korn" <uw...@xhochy.com>
Subject Re: parquet file in S3, is there a way to read a subset of all the columns in python
Date Thu, 11 Oct 2018 18:22:50 GMT
Hello Luke,

this is only partly implemented. You can do this and I already did do
this but this is sadly not in a perfect state.
boto3 itself seems to be lacking a proper file-like class. You can get
the contents of a file in S3 as
https://botocore.amazonaws.com/v1/documentation/api/latest/reference/response.html#botocore.response.StreamingBody
. This sadly seems to be missing a seek method.
In my case I did access parquet files on S3 with per-column access using
the simplekv project. There a small file-like class is implemented on
top of boto (but not boto3):
https://github.com/mbr/simplekv/blob/master/simplekv/net/botostore.py#L93
. This is what you are looking for, just the wrong boto package as well
as I know that this implementation is sadly leaking http-connections and
thus when you access too many files (even in serial) at once, your
network will suffer.
Cheers
Uwe


On Thu, Oct 11, 2018, at 8:01 PM, Luke wrote:
> I have parquet files (each self contained) in S3 and I want to read
> certain columns into a pandas dataframe without reading the entire
> object out of S3.> 
> Is this implemented?  boto3 in python supports reading from offsets in
> an S3 object but I wasn't sure anyone has made that work with a
> parquet file corresponding to certain columns?> 
> thanks,
> Luke


Mime
View raw message