arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Luke <virtuall...@gmail.com>
Subject Re: parquet file in S3, is there a way to read a subset of all the columns in python
Date Thu, 11 Oct 2018 21:27:22 GMT
This works in boto3:

import boto3

obj = boto3.resource('s3').Object('mybucketfoo', 'foo')
stream = obj.get(Range='bytes=10-100')['Body']print(stream.read())


On Thu, Oct 11, 2018 at 2:22 PM Uwe L. Korn <uwelk@xhochy.com> wrote:

> Hello Luke,
>
> this is only partly implemented. You can do this and I already did do this
> but this is sadly not in a perfect state.
>
> boto3 itself seems to be lacking a proper file-like class. You can get the
> contents of a file in S3 as
> https://botocore.amazonaws.com/v1/documentation/api/latest/reference/response.html#botocore.response.StreamingBody
.
> This sadly seems to be missing a seek method.
>
> In my case I did access parquet files on S3 with per-column access using
> the simplekv project. There a small file-like class is implemented on top
> of boto (but not boto3):
> https://github.com/mbr/simplekv/blob/master/simplekv/net/botostore.py#L93 .
> This is what you are looking for, just the wrong boto package as well as I
> know that this implementation is sadly leaking http-connections and thus
> when you access too many files (even in serial) at once, your network will
> suffer.
>
> Cheers
> Uwe
>
>
> On Thu, Oct 11, 2018, at 8:01 PM, Luke wrote:
>
> I have parquet files (each self contained) in S3 and I want to read
> certain columns into a pandas dataframe without reading the entire object
> out of S3.
>
> Is this implemented?  boto3 in python supports reading from offsets in an
> S3 object but I wasn't sure anyone has made that work with a parquet file
> corresponding to certain columns?
>
> thanks,
> Luke
>
>
>

Mime
View raw message