From user-return-1118-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Thu Mar 25 11:41:24 2021 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mxout1-he-de.apache.org (mxout1-he-de.apache.org [95.216.194.37]) by mx-eu-01.ponee.io (Postfix) with ESMTPS id 820FF18063D for ; Thu, 25 Mar 2021 12:41:24 +0100 (CET) Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by mxout1-he-de.apache.org (ASF Mail Server at mxout1-he-de.apache.org) with SMTP id B0E5864777 for ; Thu, 25 Mar 2021 11:41:23 +0000 (UTC) Received: (qmail 13009 invoked by uid 500); 25 Mar 2021 11:41:22 -0000 Mailing-List: contact user-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@arrow.apache.org Delivered-To: mailing list user@arrow.apache.org Received: (qmail 12999 invoked by uid 99); 25 Mar 2021 11:41:22 -0000 Received: from spamproc1-he-de.apache.org (HELO spamproc1-he-de.apache.org) (116.203.196.100) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 25 Mar 2021 11:41:22 +0000 Received: from localhost (localhost [127.0.0.1]) by spamproc1-he-de.apache.org (ASF Mail Server at spamproc1-he-de.apache.org) with ESMTP id 0F5DC1FF4AA for ; Thu, 25 Mar 2021 11:41:22 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamproc1-he-de.apache.org X-Spam-Flag: NO X-Spam-Score: 0.2 X-Spam-Level: X-Spam-Status: No, score=0.2 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=0.2, RCVD_IN_MSPIKE_H2=-0.001, SPF_NONE=0.001] autolearn=disabled Authentication-Results: spamproc1-he-de.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=saktor-net.20150623.gappssmtp.com Received: from mx1-ec2-va.apache.org ([116.203.227.195]) by localhost (spamproc1-he-de.apache.org [116.203.196.100]) (amavisd-new, port 10024) with ESMTP id ZaN4DPKhBawg for ; Thu, 25 Mar 2021 11:41:21 +0000 (UTC) Received-SPF: None (mailfrom) identity=mailfrom; client-ip=209.85.167.53; helo=mail-lf1-f53.google.com; envelope-from=i@saktor.net; receiver= Received: from mail-lf1-f53.google.com (mail-lf1-f53.google.com [209.85.167.53]) by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id 0C42EBD0A8 for ; Thu, 25 Mar 2021 11:41:20 +0000 (UTC) Received: by mail-lf1-f53.google.com with SMTP id v15so2035905lfq.5 for ; Thu, 25 Mar 2021 04:41:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=saktor-net.20150623.gappssmtp.com; s=20150623; h=mime-version:from:date:message-id:subject:to; bh=1PjxO5/4XMYD5kXmhiYAN/Btg5honvIlxLtcO77qCwo=; b=E3zw8WOVcCfVYbg2Ao/mzNywPTzll67DpmGRarcFMabLDlw+towa/8iU9412ZNvME5 gfw74IfyeACGRjEVyrFYE+Wi1GmFHdORlZOP3+qbO2Xd0AgsLB1FQ7gwf5scWGkin8wK On8LQqIv9ybIZwIh3YRfA3nl0myB6/7j/wNOv+XQV7bKCxjwaJbDe+98YmM2F99mFxu4 wShrFp8n95FgvUQwXNuch6Ln1PHlCXfCSowGGTb5eBTWhDvXoofthgxFbDDbxkKJBapt M3yVx48ha5I7uTuZQP6HDehGankQvtSWWPGUE3IZh241cSvQKeJYZLAXiWE3pM86c04S IriA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=1PjxO5/4XMYD5kXmhiYAN/Btg5honvIlxLtcO77qCwo=; b=fnWps82lvHOZYlHvWGCpDkXHKXJLc3HsErU8gSB2mbVZ/1kqQ6f/KzfdF83mNsXmm3 JqYmE/u0JwnziQLVkAq8o23u+G/9Bw5icAMaPTpFlHbAScPkPV/3x+eT7vrVfeeBs03c 5i+KzueuB07VG233IoJkPYIZQ/hNfvzPjXflbHrC9O142Esl6QHzI4/Vx6czeGO3n0pg 0VRvLq2ludGeV+VTbtNnBa1VRKuzqUZpM2U6azNftks1UyDO8wV1qpMxvC5IVg/OYHyO imwHGs4j7taJPD6MiLk9hej02typ1MS0//cTW/xxEi5KB8x4Xk5hwwsWeM7sWIRHTs07 hjDg== X-Gm-Message-State: AOAM533i32hQ1bj96eUXUKqOMwxFlK1gCeQbg7SIqZ6dVTdTcpksbJO1 Rq4vJwEWvWVFyGwF3lBPi8EJyKhfDfT7IOoEQbWXIpvdax6aHJiw X-Google-Smtp-Source: ABdhPJxk07JuI/pdZYFfsTRhTRtXcgTgMkYyX7m/+hXwWuzeSCXg5+ifSuJxkJ9yuR2P5x/gifI4h20WDhcih3Hdf+8= X-Received: by 2002:a19:224d:: with SMTP id i74mr4865056lfi.224.1616672479634; Thu, 25 Mar 2021 04:41:19 -0700 (PDT) MIME-Version: 1.0 From: Ira Saktor Date: Thu, 25 Mar 2021 12:41:08 +0100 Message-ID: Subject: [Python] How to know what partitions will dataset.write_dataset affect when writting? To: user@arrow.apache.org Content-Type: multipart/alternative; boundary="000000000000ddfe7805be5ae46d" --000000000000ddfe7805be5ae46d Content-Type: text/plain; charset="UTF-8" Hello, I am trying to overwrite partitions when writing a table to HDFS using pyarrow. I would like to know what is the recommended way to figure out which directories I should clear before writing the dataset? My current approach is to convert the pyarrow.table to pandas dataframe, use groupby on the partitioning columns and from that figure out which directories will be affected. However, I'd like to avoid conversion to pandas if possible and I hope that since pyarrow is able to figure out where to write the data quite fast, I could somehow reuse the way it detects the paths to write to. Thank you! Best regards, Ira --000000000000ddfe7805be5ae46d Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hello,

I am trying to overwrite pa= rtitions when writing a table to HDFS using pyarrow. I would like to know w= hat is the recommended way to figure out which directories I should clear b= efore writing=C2=A0the dataset?

My current approach is to convert th= e pyarrow.table to pandas dataframe, use groupby on the partitioning=C2=A0c= olumns and from that figure out which directories will be affected. However= , I'd like to avoid conversion to pandas if possible and I hope that si= nce pyarrow is able to figure out where to write the data quite fast, I cou= ld somehow reuse the way it detects the paths to write to.=C2=A0

Thank you!

Best regards,

Ira<= /div>
--000000000000ddfe7805be5ae46d--