arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Nugent <nug...@gmail.com>
Subject Re: 'Plain' Dataset Python API doesn't memory map?
Date Sun, 03 May 2020 07:10:22 GMT
Thanks Joris. That did the trick.

-Dan Nugent
On Apr 30, 2020, 10:01 -0400, Wes McKinney <wesmckinn@gmail.com>, wrote:
> For the record, as I've stated elsewhere I'm fairly sure, I don't
> agree with toggling memory mapping at the filesystem level. If a
> filesystem supports memory mapping, then a consumer of the filesystem
> should IMHO be able to request a memory map.
>
> On Thu, Apr 30, 2020 at 2:27 AM Joris Van den Bossche
> <jorisvandenbossche@gmail.com> wrote:
> >
> > Hi Dan,
> >
> > Currently, the memory mapping in the Datasets API is controlled by the filesystem.
So to enable memory mapping for feather, you can do:
> >
> > import pyarrow.dataset as ds
> > from pyarrow.fs import LocalFileSystem
> >
> > fs = LocalFileSystem(use_mmap=True)
> > t = ds.dataset('demo', format='feather', filesystem=fs).to_table()
> >
> > Can you try if that is working for you?
> > We should better document this (and there is actually also some discussion about
the best API for this, see https://issues.apache.org/jira/browse/ARROW-8156, https://issues.apache.org/jira/browse/ARROW-8307)
> >
> > Joris
> >
> > On Thu, 30 Apr 2020 at 01:58, Daniel Nugent <nugend@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > I'm trying to use the 0.17 dataset API to map in an arrow table in the uncompressed
feather format (ultimately hoping to work with data larger than memory). It seems like it
reads all the constituent files into memory before creating the Arrow table object though.
> > >
> > > When I use the FeatherDataset API, it does appear to work map the files and
the Table is created based off of mapped data.
> > >
> > > Any hints at what I'm doing wrong? I didn't see any options relating to memory
mapping for the general datasets
> > >
> > > Here's the code for the plain dataset api call:
> > >
> > > from pyarrow.dataset import dataset as ds
> > > t = ds('demo', format='feather').read_table()
> > >
> > > Here's the code for reading using the FeatherDataset api:
> > >
> > > from pyarrow.feather import FeatherDataset as ds
> > > from pathlib import Path
> > > t = ds(list(Path('demo').iterdir())).read_table()
> > >
> > > Thanks!
> > >
> > > -Dan Nugent

Mime
View raw message