From user-return-30-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Sun Oct 14 21:30:05 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 24726180670 for ; Sun, 14 Oct 2018 21:30:04 +0200 (CEST) Received: (qmail 10730 invoked by uid 500); 14 Oct 2018 19:30:04 -0000 Mailing-List: contact user-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@arrow.apache.org Delivered-To: mailing list user@arrow.apache.org Received: (qmail 10720 invoked by uid 99); 14 Oct 2018 19:30:04 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 14 Oct 2018 19:30:04 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id AE74EC0346 for ; Sun, 14 Oct 2018 19:30:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.112 X-Spam-Level: X-Spam-Status: No, score=-0.112 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001, T_DKIMWL_WL_MED=-0.01] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id B1B64L8Y9KHH for ; Sun, 14 Oct 2018 19:30:02 +0000 (UTC) Received: from mail-it1-f169.google.com (mail-it1-f169.google.com [209.85.166.169]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 97B995F386 for ; Sun, 14 Oct 2018 19:30:01 +0000 (UTC) Received: by mail-it1-f169.google.com with SMTP id 74-v6so25819544itw.1 for ; Sun, 14 Oct 2018 12:30:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :content-transfer-encoding; bh=TS1H8E3IUVn8IC3eQOYqLiW+rA2QC3az5z1wwmoV9kE=; b=C7al69VIPj+dTF3lUew2NDsTYnZcLQ0jcNpV38Dz9MayRTdfYtBWOaphe8sak8IC8q y83XwN/7M/iulcaFFIlYkvRQUs5HF8ey1e9+0GAVdGdNL02w3szMAKOzuatRJMOwKA04 b1GPNnQqOqjaS9G7sPkpNIlYARoCBISSWM5r+hycqMRKrZVZZ/5Es4ZlaGhxD3JNUgon 31pS9fBZMGTilkRlImAJG7oPnYPCRff3rMRDsoPHe/lv3ntRZlPvkkEAckc4BCcXwgyP sa1hkHyvMRKLAc7Ft+CRRet1urjNuCs3t2XT6hT5FFJyidOTGvy+dGZKs51iQ/o/DkV4 1ydw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:content-transfer-encoding; bh=TS1H8E3IUVn8IC3eQOYqLiW+rA2QC3az5z1wwmoV9kE=; b=kRrbKkkDhXUnLY6hjGRhc7SGokSBeto4U5oWkpqHgeA9WXUAK3r6TFvgOtH51j4CKI iEqzuJY65/UzZj816bHF1kHvPgVebItydEeleYDgdKLA3mxVJK381+Zlt+14+HG9WZUb k1NzQVM6j2ZiBLISYXx7HMfwgIfWs5o0xD05C6MTUwNL6dO9zt1dDk8YTJB4cJKNi1b6 CbWibmFkHkbmMdsqBXG9Yt4EUManfSb7KMgiasdVaMhlWhkbHaOWmWsBPS78nUs84nrO wR8J6478Mh0ZIA8dk34xL3U3V5aqwMBv1kXshqnbFZuNco9zUFNv3UN2KdyXoHxBehn+ jZ/A== X-Gm-Message-State: ABuFfoj0nn4ZGqfUh7Ok7KRXnJ/yb0CbXoykK8Dt4xjggq8SYD3Uildp tjQrXay2io5FcTgxIBopdW75zPJYNe7fdFp7Po8GPnli X-Google-Smtp-Source: ACcGV606fAMzKd2EQrwhF/kxZnibjbyBzoBZ0WOCfNM8ONdAvvdgGhvXT7O7PfW2m4JExeooVxe65i1O63VCdEhnIME= X-Received: by 2002:a02:c7cc:: with SMTP id s12-v6mr11888564jao.10.1539545399719; Sun, 14 Oct 2018 12:29:59 -0700 (PDT) MIME-Version: 1.0 References: <1539282170.3144295.1538888952.026F6649@webmail.messagingengine.com> <1539350015.2350268.1539785680.2721ECAA@webmail.messagingengine.com> In-Reply-To: From: Wes McKinney Date: Sun, 14 Oct 2018 15:29:23 -0400 Message-ID: Subject: Re: parquet file in S3, is there a way to read a subset of all the columns in python To: user@arrow.apache.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable You should be able to use s3fs, both the file handles it creates as well as a filesystem to read multifile datasets: https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_parqu= et.py#L1441 On Fri, Oct 12, 2018 at 12:03 PM Luke wrote: > > It looks like https://github.com/dask/s3fs implements these methods. Wou= ld there need to be a wrapper over this for arrow or is it compatible as is= ? > > -Luke > > On Fri, Oct 12, 2018 at 9:13 AM Uwe L. Korn wrote: >> >> That looks nice. Once you have wrapped that in a class that implements r= ead and seek like a Python file object, you should be able to pass this to = `pyarrow.parquet.read_table`. When you then set the columns argument on tha= t function, only the respective byte ranges are then requested from S3. To = minimise the number of requests, I would suggest you to implement the S3 fi= le with the exact ranges provided from the outside but when using pyarrow, = you should wrap your S3 file in an io.BufferedReader. pyarrow.parquet reque= sts exactly the ranges it needs but that can sometimes be too coarse for ob= ject stores like S3. There you often like to do the tradeoff of requesting = some bytes more for a fewer number of requests. >> >> Uwe >> >> >> On Thu, Oct 11, 2018, at 11:27 PM, Luke wrote: >> >> This works in boto3: >> >> import boto3 >> >> obj =3D boto3.resource('s3').Object('mybucketfoo', 'foo') >> stream =3D obj.get(Range=3D'bytes=3D10-100')['Body'] >> print(stream.read()) >> >> >> On Thu, Oct 11, 2018 at 2:22 PM Uwe L. Korn wrote: >> >> >> Hello Luke, >> >> this is only partly implemented. You can do this and I already did do th= is but this is sadly not in a perfect state. >> >> boto3 itself seems to be lacking a proper file-like class. You can get t= he contents of a file in S3 as https://botocore.amazonaws.com/v1/documentat= ion/api/latest/reference/response.html#botocore.response.StreamingBody . Th= is sadly seems to be missing a seek method. >> >> In my case I did access parquet files on S3 with per-column access using= the simplekv project. There a small file-like class is implemented on top = of boto (but not boto3): https://github.com/mbr/simplekv/blob/master/simple= kv/net/botostore.py#L93 . This is what you are looking for, just the wrong = boto package as well as I know that this implementation is sadly leaking ht= tp-connections and thus when you access too many files (even in serial) at = once, your network will suffer. >> >> Cheers >> Uwe >> >> >> On Thu, Oct 11, 2018, at 8:01 PM, Luke wrote: >> >> I have parquet files (each self contained) in S3 and I want to read cert= ain columns into a pandas dataframe without reading the entire object out o= f S3. >> >> Is this implemented? boto3 in python supports reading from offsets in a= n S3 object but I wasn't sure anyone has made that work with a parquet file= corresponding to certain columns? >> >> thanks, >> Luke >> >> >>