arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joris Van den Bossche <jorisvandenboss...@gmail.com>
Subject Re: 'Plain' Dataset Python API doesn't memory map?
Date Thu, 30 Apr 2020 07:27:06 GMT
Hi Dan,

Currently, the memory mapping in the Datasets API is controlled by the
filesystem. So to enable memory mapping for feather, you can do:

import pyarrow.dataset as ds
from pyarrow.fs import LocalFileSystem

fs = LocalFileSystem(use_mmap=True)
t = ds.dataset('demo', format='feather', filesystem=fs).to_table()

Can you try if that is working for you?
We should better document this (and there is actually also some discussion
about the best API for this, see
https://issues.apache.org/jira/browse/ARROW-8156,
https://issues.apache.org/jira/browse/ARROW-8307)

Joris

On Thu, 30 Apr 2020 at 01:58, Daniel Nugent <nugend@gmail.com> wrote:

> Hi,
>
> I'm trying to use the 0.17 dataset API to map in an arrow table in the
> uncompressed feather format (ultimately hoping to work with data larger
> than memory). It seems like it reads all the constituent files into memory
> before creating the Arrow table object though.
>
> When I use the FeatherDataset API, it does appear to work map the files
> and the Table is created based off of mapped data.
>
> Any hints at what I'm doing wrong? I didn't see any options relating to
> memory mapping for the general datasets
>
> Here's the code for the plain dataset api call:
>
>     from pyarrow.dataset import dataset as ds
>     t = ds('demo', format='feather').read_table()
>
> Here's the code for reading using the FeatherDataset api:
>
>     from pyarrow.feather import FeatherDataset as ds
>     from pathlib import Path
>     t = ds(list(Path('demo').iterdir())).read_table()
>
> Thanks!
>
> -Dan Nugent
>

Mime
View raw message