From user-return-405-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Thu Apr 30 07:27:29 2020 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id EB003180630 for ; Thu, 30 Apr 2020 09:27:28 +0200 (CEST) Received: (qmail 81131 invoked by uid 500); 30 Apr 2020 07:27:28 -0000 Mailing-List: contact user-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@arrow.apache.org Delivered-To: mailing list user@arrow.apache.org Received: (qmail 81117 invoked by uid 99); 30 Apr 2020 07:27:27 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 30 Apr 2020 07:27:27 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 4B1951A3233 for ; Thu, 30 Apr 2020 07:27:27 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.001 X-Spam-Level: X-Spam-Status: No, score=-0.001 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-ec2-va.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id 5vhBwQWa4LeN for ; Thu, 30 Apr 2020 07:27:26 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=209.85.166.45; helo=mail-io1-f45.google.com; envelope-from=jorisvandenbossche@gmail.com; receiver= Received: from mail-io1-f45.google.com (mail-io1-f45.google.com [209.85.166.45]) by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id D89AFBB8DE for ; Thu, 30 Apr 2020 07:27:25 +0000 (UTC) Received: by mail-io1-f45.google.com with SMTP id b12so417071ion.8 for ; Thu, 30 Apr 2020 00:27:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=uChw+ocbHxrixkjtA5u2N48ppP2n1pN21AXezLVmrE8=; b=OLe1QQByHMQe8pr/l9lP7/2Wg8PH1z0X3xmC5vw84SjUO6wdgn/ojYxvSQaITJVria fKaEXGm/HgrX3fBlh1Xbd0ywT2LeaY38mp87aJK5Qq/WSj9ULWCRpBJNLORn6uVOp33H JtrgmycWFU5F/BL15mfagnTfDu0xGE9JrNt8rEY5v8FaNqRlN16rWwM4DNFKCVz/fS6t Y1WAJBDGBbWztTgaJoWMiAtvZG+nTAj9f/9fPnoLPQxPaIf3Om9fHr8ksD5eUj4PuW6R 3AMDUKUoKRcyGzn8Pm6UToBYeT9z0PYlx9d7DEqBJJE+uo45YwamoeEFj5dQUQw9zctj rbzg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=uChw+ocbHxrixkjtA5u2N48ppP2n1pN21AXezLVmrE8=; b=INRa5eg/g9J7bQJT4zgyTZg6qbN59sYewpzgpcyGn7GmrtoGupxx0M3Fs6rnUwDN7K grNeqEYeKraaB5JaiFTKHurquot206pwmzZjXyxz2Xtbev+wrG9aq2gsDBIXYCVZSHSZ ZMJohncnBQ7vz4s+N2s1NL9EZLOY/uZbHMBGA9ftHE740CohoaV2WFEb+80oQxXrxH9z VHM5IpxgPuTRnVX4NAjbgtPgNVnSNzUsJs+vJEPl1X/Mrmy2FO2y7I8euW5ZOn2oOQqs V/VhCxySf01FGijYllUqYkSgktD9sjfg5ULOrS0/3L76okKrooFaV+G6dBfbJwS2odep zW0g== X-Gm-Message-State: AGi0Pub5RfGhixYauvJuV7HrbCLsn6VW6WvGcHtWXWM8qmWcr2AkqaUh jgLBXjbqNgv5Re3LLMS+Hn80BQJcZMbBRjbpXj3JKITK/c4= X-Google-Smtp-Source: APiQypIJh7WRZx3WTIFLrHht/7s0iW4j8Q6gzbFJgQPmX5O1KshPiXuYjb0QFc4FIVMUe9KebjcHVYGCPZBhaWEFKUk= X-Received: by 2002:a02:a90e:: with SMTP id n14mr532413jam.97.1588231638263; Thu, 30 Apr 2020 00:27:18 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Joris Van den Bossche Date: Thu, 30 Apr 2020 09:27:06 +0200 Message-ID: Subject: Re: 'Plain' Dataset Python API doesn't memory map? To: user@arrow.apache.org Content-Type: multipart/alternative; boundary="0000000000009eaa6d05a47cfe03" --0000000000009eaa6d05a47cfe03 Content-Type: text/plain; charset="UTF-8" Hi Dan, Currently, the memory mapping in the Datasets API is controlled by the filesystem. So to enable memory mapping for feather, you can do: import pyarrow.dataset as ds from pyarrow.fs import LocalFileSystem fs = LocalFileSystem(use_mmap=True) t = ds.dataset('demo', format='feather', filesystem=fs).to_table() Can you try if that is working for you? We should better document this (and there is actually also some discussion about the best API for this, see https://issues.apache.org/jira/browse/ARROW-8156, https://issues.apache.org/jira/browse/ARROW-8307) Joris On Thu, 30 Apr 2020 at 01:58, Daniel Nugent wrote: > Hi, > > I'm trying to use the 0.17 dataset API to map in an arrow table in the > uncompressed feather format (ultimately hoping to work with data larger > than memory). It seems like it reads all the constituent files into memory > before creating the Arrow table object though. > > When I use the FeatherDataset API, it does appear to work map the files > and the Table is created based off of mapped data. > > Any hints at what I'm doing wrong? I didn't see any options relating to > memory mapping for the general datasets > > Here's the code for the plain dataset api call: > > from pyarrow.dataset import dataset as ds > t = ds('demo', format='feather').read_table() > > Here's the code for reading using the FeatherDataset api: > > from pyarrow.feather import FeatherDataset as ds > from pathlib import Path > t = ds(list(Path('demo').iterdir())).read_table() > > Thanks! > > -Dan Nugent > --0000000000009eaa6d05a47cfe03 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Dan,

Currently, the memor= y mapping in the Datasets API is controlled by the filesystem. So to enable= memory mapping for feather, you can do:

import pyarrow.dataset as ds
= from pyarrow.fs import LocalFileSyste= m

<= div>fs =3D LocalFileSystem(use_mmap= =3DTrue)
t =3D ds.da= taset('demo', format=3D'feather', filesystem=3Dfs).to_table= ()

Can you try if that is working for you?<= /div>
We should better document this (and there is actually also some d= iscussion about the best API for this, see https://issues.apache.org/jira/browse/ARROW-81= 56, https:= //issues.apache.org/jira/browse/ARROW-8307)

Jo= ris

On Thu, 30 Apr 2020 at 01:58, Daniel Nugent <nugend@gmail.com> wrote:
Hi,
<= br>
I'm trying to use the 0.17 dataset API to map in an arrow= table in the uncompressed feather format (ultimately hoping to work with d= ata larger than memory). It seems like it reads all the constituent files i= nto memory before creating the Arrow table object though.

When I use the FeatherDataset API, it does appear to work map the f= iles and the Table is created based off of mapped data.

<= /div>
Any hints at what I'm doing wrong? I didn't see any optio= ns relating to memory mapping for the general datasets

Here's the code for the plain dataset api call:

=
=C2=A0=C2=A0=C2=A0 from pyarrow.dataset import dataset as ds
=C2=A0=C2=A0=C2=A0 t =3D ds('demo', format=3D'= ;feather').read_table()

<= div>Here's the code for reading using the FeatherDataset api:

=C2=A0=C2=A0=C2=A0 from pyarrow.feather import FeatherDatas= et as ds
=C2=A0=C2=A0=C2=A0 from pathlib import Path
=C2=A0=C2=A0=C2=A0 t =3D ds(list(Path('demo').iterdir())).read_ta= ble()

Thanks!

-Dan Nugent
--0000000000009eaa6d05a47cfe03--