From user-return-1282-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Sun Jun 27 13:23:08 2021 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mxout1-he-de.apache.org (mxout1-he-de.apache.org [95.216.194.37]) by mx-eu-01.ponee.io (Postfix) with ESMTPS id 1DFC8180663 for ; Sun, 27 Jun 2021 15:23:08 +0200 (CEST) Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by mxout1-he-de.apache.org (ASF Mail Server at mxout1-he-de.apache.org) with SMTP id 2EA8E6072D for ; Sun, 27 Jun 2021 13:23:06 +0000 (UTC) Received: (qmail 96675 invoked by uid 500); 27 Jun 2021 13:23:05 -0000 Mailing-List: contact user-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@arrow.apache.org Delivered-To: mailing list user@arrow.apache.org Received: (qmail 96665 invoked by uid 99); 27 Jun 2021 13:23:05 -0000 Received: from spamproc1-he-de.apache.org (HELO spamproc1-he-de.apache.org) (116.203.196.100) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 27 Jun 2021 13:23:05 +0000 Received: from localhost (localhost [127.0.0.1]) by spamproc1-he-de.apache.org (ASF Mail Server at spamproc1-he-de.apache.org) with ESMTP id C37781FF48A for ; Sun, 27 Jun 2021 13:23:04 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamproc1-he-de.apache.org X-Spam-Flag: NO X-Spam-Score: 0 X-Spam-Level: X-Spam-Status: No, score=0 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.2, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamproc1-he-de.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=grantwilliams.dev Received: from mx1-he-de.apache.org ([116.203.227.195]) by localhost (spamproc1-he-de.apache.org [116.203.196.100]) (amavisd-new, port 10024) with ESMTP id JzjlZsN3VThq for ; Sun, 27 Jun 2021 13:23:04 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2607:f8b0:4864:20::631; helo=mail-pl1-x631.google.com; envelope-from=grant@grantwilliams.dev; receiver= Received: from mail-pl1-x631.google.com (mail-pl1-x631.google.com [IPv6:2607:f8b0:4864:20::631]) by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id 2341E7FE7A for ; Sun, 27 Jun 2021 13:23:04 +0000 (UTC) Received: by mail-pl1-x631.google.com with SMTP id b3so7343292plg.2 for ; Sun, 27 Jun 2021 06:23:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=grantwilliams.dev; s=google; h=mime-version:from:date:message-id:subject:to; bh=C6+lixJcqRoboZ8Tt0ptXTgtkaeAl7hJ6exbZ0JBLF8=; b=vxE3vpLuOB0g+HTGxpFC0+YcHsy8c3O3qUEU9dwWohDt13n8UdyRC3/vzrA9hR3qEz 6t5kfFp6ArWXwABzuA+jXb7GiPTORXzi6MjeU6+PvmGieRL7OCd/6EZg0Kbf0eX2cBYb feF9Vk5b17xJCHR+uZ9t5NkAxi7rIGHQXl/XtNu8oe01b/teAFL9YWv4peF3OEJHdA5j 9hu+jiKWdfvmESagTJ9OH+YH5kx4sc6myj4xqe5ZKfzXPIYFZeq8+mfo7XKVaeg0Ieyz B6vg5z2338UIZNmz4efjJXeFFmT7kH1gWiPR+8db6trDSxNmmB9qG3QKB0SpGF9In6fH OCvg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=C6+lixJcqRoboZ8Tt0ptXTgtkaeAl7hJ6exbZ0JBLF8=; b=Y8RupRS9R0ELxyATCqM24bg0fKFiqOwJQALUsUgi4+CZWGNptp4rFFDVKp+EYvTMPJ gns7CUbaPoKAee6E4NlFAI9cZiTyOLV17JhDjUEzVC6DKass0AA4XzTFf47rsnNyWF8/ quZ4eGqgAKq/rwe7QWEhbH67WLjaoVF+nqtcwqF475b0lyZbJVkG9iBM4qmAIdpbB1TG /MSsKpoQkccY0Tjxz4Ew6wDA9zJQ6Fx33sRmEw+r0yLhIBRPEhlESaZaO6BTkpA3Qm4h +f6E2kgI3umzRMuDwM/xNtxvIMbNxJWgrwfPkonRwsSYitzsF+92eJVB1RtpyJdl2Nlf TWUw== X-Gm-Message-State: AOAM532GUdOrOp+Rz3e8UEHGgoUYNae6ooq1+p4UyJ7jcp9sYM4jLgNL Z2tNJ1lNGEMPXiRsNWttRAsi2phicre8qIfPK1dw1SJROUnB5epzvyo= X-Google-Smtp-Source: ABdhPJz6CKn8ebKCRUdT7arFnaeME+MzPW6YzC8VyQshqAK1U33ZjTU1rpr+l2irEAX9QLh+dfFck/gYEg9AhjQtML4= X-Received: by 2002:a17:90b:33c6:: with SMTP id lk6mr32278008pjb.6.1624800182479; Sun, 27 Jun 2021 06:23:02 -0700 (PDT) MIME-Version: 1.0 From: Grant Williams Date: Sun, 27 Jun 2021 08:22:51 -0500 Message-ID: Subject: [python] [iter_batches] Is there any value to an iterator based parquet reader in python? To: user@arrow.apache.org Content-Type: multipart/alternative; boundary="000000000000b53c3505c5bf45cd" --000000000000b53c3505c5bf45cd Content-Type: text/plain; charset="UTF-8" Hello, I've found myself wondering if there is a use case for using the iter_batches method in python as an iterator in a similar style to a server-side cursor in Postgres. Right now you can use an iterator of record batches, but I wondered if having some sort of python native iterator might be worth it? Maybe a .to_pyiter() method that converts it to a lazy & batched iterator of native python objects? Here is some example code that shows a similar result. from itertools import chain from typing import Tuple, Any def iter_parquet(parquet_file, columns = None, batch_size=1_000) -> Tuple[Any]: record_batches = parquet_file.iter_batches(batch_size=batch_size, columns=columns) # convert from columnar format of pyarrow arrays to a row format of python objects (yields tuples) yield from chain.from_iterable(zip(*map(lambda col: col.to_pylist(), batch.columns)) for batch in record_batches) (or a gist if you prefer: https://gist.github.com/grantmwilliams/143fd60b3891959a733d0ce5e195f71d) I realize arrow is a columnar format, but I wonder if having the buffered row reading as a lazy iterator is a common enough use case with parquet + object storage being so common as a database alternative. Thanks, Grant -- Grant Williams Machine Learning Engineer https://github.com/grantmwilliams/ --000000000000b53c3505c5bf45cd Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hello,

I've found myself wondering if there is = a use case for using the iter_batches method in python as an iterator in a = similar style to a server-side cursor in Postgres. Right now you can use an= iterator of record batches, but I wondered if having some sort of python n= ative iterator might be worth it? Maybe a .to_pyiter() method that converts= it to a lazy & batched iterator of native python objects?

Here is some example code that shows a similar result.
from itertools import chain
from typing import Tuple, Any

def iter_parquet(parquet_file, columns =3D None, batch_size=3D1_000) -> =
Tuple[Any]:

        record_batches =3D parquet_file.iter_batches(batch_size=3Dbatch_siz=
e, columns=3Dcolumns)

        # convert from columnar format of pyarrow arrays to a row format of=
 python objects (yields tuples)
        yield from chain.from_iterable(zip(*map(lambda col: col.to_pylist()=
, batch.columns)) for batch in record_batches)

I realiz= e arrow is a columnar format, but I wonder if having the buffered row readi= ng as a lazy iterator is a common enough use case with parquet=C2=A0+ objec= t storage being so common as a database alternative.

Thanks,
Grant

--
Grant Willi= ams
Machine Learning Engineer
https://github.com/grantmwilliams/
--000000000000b53c3505c5bf45cd--