arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: 'Plain' Dataset Python API doesn't memory map?
Date Thu, 30 Apr 2020 14:00:08 GMT
For the record, as I've stated elsewhere I'm fairly sure, I don't
agree with toggling memory mapping at the filesystem level. If a
filesystem supports memory mapping, then a consumer of the filesystem
should IMHO be able to request a memory map.

On Thu, Apr 30, 2020 at 2:27 AM Joris Van den Bossche
<jorisvandenbossche@gmail.com> wrote:
>
> Hi Dan,
>
> Currently, the memory mapping in the Datasets API is controlled by the filesystem. So
to enable memory mapping for feather, you can do:
>
> import pyarrow.dataset as ds
> from pyarrow.fs import LocalFileSystem
>
> fs = LocalFileSystem(use_mmap=True)
> t = ds.dataset('demo', format='feather', filesystem=fs).to_table()
>
> Can you try if that is working for you?
> We should better document this (and there is actually also some discussion about the
best API for this, see https://issues.apache.org/jira/browse/ARROW-8156, https://issues.apache.org/jira/browse/ARROW-8307)
>
> Joris
>
> On Thu, 30 Apr 2020 at 01:58, Daniel Nugent <nugend@gmail.com> wrote:
>>
>> Hi,
>>
>> I'm trying to use the 0.17 dataset API to map in an arrow table in the uncompressed
feather format (ultimately hoping to work with data larger than memory). It seems like it
reads all the constituent files into memory before creating the Arrow table object though.
>>
>> When I use the FeatherDataset API, it does appear to work map the files and the Table
is created based off of mapped data.
>>
>> Any hints at what I'm doing wrong? I didn't see any options relating to memory mapping
for the general datasets
>>
>> Here's the code for the plain dataset api call:
>>
>>     from pyarrow.dataset import dataset as ds
>>     t = ds('demo', format='feather').read_table()
>>
>> Here's the code for reading using the FeatherDataset api:
>>
>>     from pyarrow.feather import FeatherDataset as ds
>>     from pathlib import Path
>>     t = ds(list(Path('demo').iterdir())).read_table()
>>
>> Thanks!
>>
>> -Dan Nugent

Mime
View raw message