From user-return-28-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Fri Oct 12 15:13:45 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id B1B01180660 for ; Fri, 12 Oct 2018 15:13:44 +0200 (CEST) Received: (qmail 77578 invoked by uid 500); 12 Oct 2018 13:13:43 -0000 Mailing-List: contact user-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@arrow.apache.org Delivered-To: mailing list user@arrow.apache.org Received: (qmail 77568 invoked by uid 99); 12 Oct 2018 13:13:43 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 12 Oct 2018 13:13:43 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 53C29C00E8 for ; Fri, 12 Oct 2018 13:13:43 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.198 X-Spam-Level: * X-Spam-Status: No, score=1.198 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=xhochy.com header.b=B3c3foF6; dkim=pass (2048-bit key) header.d=messagingengine.com header.b=aXvkaykJ Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id DNQMWeo1sYio for ; Fri, 12 Oct 2018 13:13:41 +0000 (UTC) Received: from out3-smtp.messagingengine.com (out3-smtp.messagingengine.com [66.111.4.27]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id B938D5F1BD for ; Fri, 12 Oct 2018 13:13:41 +0000 (UTC) Received: from compute5.internal (compute5.nyi.internal [10.202.2.45]) by mailout.nyi.internal (Postfix) with ESMTP id CDB6921D3C for ; Fri, 12 Oct 2018 09:13:35 -0400 (EDT) Received: from web5 ([10.202.2.215]) by compute5.internal (MEProxy); Fri, 12 Oct 2018 09:13:35 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=xhochy.com; h= message-id:from:to:mime-version:content-transfer-encoding :content-type:subject:references:in-reply-to:date; s=fm1; bh=UXN T7D4MRifIZmac1lIXaCbe1tRs1bQmnEsJsGCBnvw=; b=B3c3foF6JbFsUZwxrAf mFkdoT5u0nbkDV5bE1G5HxE8FkkmWfdla5tW8njRr0+8McPGd0xOfQDa5WdDteIM DA5kIo5Zowm5YJUULg/Ie3+aRBVe6aV3I9ZShTZMuSdlKFEvAMdDdIPW3KLtfZsl wJVaUBs8OTBFX0a0PDuzq20vlJCKp2Tp7XHT2HF4xWauO9cjiMDGcIFEdPligjU0 aWz0jb/3bDcxGqOYJXhH/PXmn+DFfjCRP+FRkMhDl++VAZJI9jP2rtpJYpH8x6gu Nio/L+3HxBJAsIJ36WhzNee1epgl0YOw7B9ClNv4vEkZWuPDi9sUjdHLZfEOnLCe nAQ== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=content-transfer-encoding:content-type :date:from:in-reply-to:message-id:mime-version:references :subject:to:x-me-proxy:x-me-proxy:x-me-sender:x-me-sender :x-sasl-enc; s=fm1; bh=UXNT7D4MRifIZmac1lIXaCbe1tRs1bQmnEsJsGCBn vw=; b=aXvkaykJZw9dPKwZXsPL16kM6k0sTLGCyPto3TZRxM2+YntJ1j0qqFiyS /AwsLfOQ8cp0wIeGIC0TnovFwFMjyRKz2DIRsBGGbtLSC3DzuSIDV5qgo6OSmDd0 m+fbBIMloT6TyF1bnSmZSmg6NI7k2+9EfwA3wAt6MfMpocCgS/w7RdHw1SZRoOG6 ALsOseespE1h/nSEdev2VZ4v8/k/Fe7pPQ7AFd6nNJY44iou9F5bic1eaEQHd5Jo rQe0K2v3OVSjRpOybkEQ/eJ7wH5KBkkvyj5ftEKxkRvm+zrWXh3msVGk4ynKdzhG dS+mBLx56OAvNc8PKFXKCFIFL3dAA== X-ME-Sender: X-ME-Proxy: Received: by mailuser.nyi.internal (Postfix, from userid 99) id 621D49E107; Fri, 12 Oct 2018 09:13:35 -0400 (EDT) Message-Id: <1539350015.2350268.1539785680.2721ECAA@webmail.messagingengine.com> From: "Uwe L. Korn" To: user@arrow.apache.org MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Type: multipart/alternative; boundary="_----------=_153935001523502680" X-Mailer: MessagingEngine.com Webmail Interface - ajax-6804a824 Subject: Re: parquet file in S3, is there a way to read a subset of all the columns in python References: <1539282170.3144295.1538888952.026F6649@webmail.messagingengine.com> In-Reply-To: Date: Fri, 12 Oct 2018 15:13:35 +0200 This is a multi-part message in MIME format. --_----------=_153935001523502680 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset="utf-8" That looks nice. Once you have wrapped that in a class that implements read and seek like a Python file object, you should be able to pass this to `pyarrow.parquet.read_table`. When you then set the columns argument on that function, only the respective byte ranges are then requested from S3. To minimise the number of requests, I would suggest you to implement the S3 file with the exact ranges provided from the outside but when using pyarrow, you should wrap your S3 file in an io.BufferedReader. pyarrow.parquet requests exactly the ranges it needs but that can sometimes be too coarse for object stores like S3. There you often like to do the tradeoff of requesting some bytes more for a fewer number of requests. Uwe On Thu, Oct 11, 2018, at 11:27 PM, Luke wrote: > This works in boto3: > import boto3 obj = boto3.resource('s3').Object('mybucketfoo', 'foo') > stream = obj.get(Range='bytes=10-100')['Body'] print(stream.read())> > On Thu, Oct 11, 2018 at 2:22 PM Uwe L. Korn wrote:>> __ >> Hello Luke, >> >> this is only partly implemented. You can do this and I already did do >> this but this is sadly not in a perfect state.>> >> boto3 itself seems to be lacking a proper file-like class. You can >> get the contents of a file in S3 as >> https://botocore.amazonaws.com/v1/documentation/api/latest/reference/response.html#botocore.response.StreamingBody >> . This sadly seems to be missing a seek method.>> >> In my case I did access parquet files on S3 with per-column access >> using the simplekv project. There a small file-like class is >> implemented on top of boto (but not boto3): >> https://github.com/mbr/simplekv/blob/master/simplekv/net/botostore.py#L93 >> . This is what you are looking for, just the wrong boto package as >> well as I know that this implementation is sadly leaking http- >> connections and thus when you access too many files (even in serial) >> at once, your network will suffer.>> >> Cheers >> Uwe >> >> >> On Thu, Oct 11, 2018, at 8:01 PM, Luke wrote: >>> I have parquet files (each self contained) in S3 and I want to read >>> certain columns into a pandas dataframe without reading the entire >>> object out of S3.>>> >>> Is this implemented? boto3 in python supports reading from offsets >>> in an S3 object but I wasn't sure anyone has made that work with a >>> parquet file corresponding to certain columns?>>> >>> thanks, >>> Luke >> --_----------=_153935001523502680 Content-Transfer-Encoding: 7bit Content-Type: text/html; charset="utf-8"
That looks nice. Once you have wrapped that in a class that implements read and seek like a Python file object, you should be able to pass this to `pyarrow.parquet.read_table`. When you then set the columns argument on that function, only the respective byte ranges are then requested from S3. To minimise the number of requests, I would suggest you to implement the S3 file with the exact ranges provided from the outside but when using pyarrow, you should wrap your S3 file in an io.BufferedReader. pyarrow.parquet requests exactly the ranges it needs but that can sometimes be too coarse for object stores like S3. There you often like to do the tradeoff of requesting some bytes more for a fewer number of requests.

Uwe


On Thu, Oct 11, 2018, at 11:27 PM, Luke wrote:
This works in boto3:
import boto3

obj = boto3.resource('s3').Object('mybucketfoo', 'foo')
stream = obj.get(Range='bytes=10-100')['Body']
print(stream.read())

On Thu, Oct 11, 2018 at 2:22 PM Uwe L. Korn <uwelk@xhochy.com> wrote:

Hello Luke,

this is only partly implemented. You can do this and I already did do this but this is sadly not in a perfect state.

boto3 itself seems to be lacking a proper file-like class. You can get the contents of a file in S3 as https://botocore.amazonaws.com/v1/documentation/api/latest/reference/response.html#botocore.response.StreamingBody . This sadly seems to be missing a seek method. 

In my case I did access parquet files on S3 with per-column access using the simplekv project. There a small file-like class is implemented on top of boto (but not boto3): https://github.com/mbr/simplekv/blob/master/simplekv/net/botostore.py#L93 . This is what you are looking for, just the wrong boto package as well as I know that this implementation is sadly leaking http-connections and thus when you access too many files (even in serial) at once, your network will suffer. 

Cheers
Uwe


On Thu, Oct 11, 2018, at 8:01 PM, Luke wrote:
I have parquet files (each self contained) in S3 and I want to read certain columns into a pandas dataframe without reading the entire object out of S3.  

Is this implemented?  boto3 in python supports reading from offsets in an S3 object but I wasn't sure anyone has made that work with a parquet file corresponding to certain columns?

thanks,
Luke


--_----------=_153935001523502680--