arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: Does Arrow Support Larger-than-Memory Handling?
Date Thu, 22 Oct 2020 20:49:09 GMT
I'm not sure where the conflict in what's written online is, but by
virtue of being designed such that data structures do not require
memory buffers to be RAM resident (i.e. can reference memory maps), we
are set up well to process larger-than-memory datasets. In C++ at
least we are putting the pieces in place to be able to do efficient
query execution on on-disk datasets, and it may already be possible in
Rust with DataFusion.

On Thu, Oct 22, 2020 at 2:11 PM Chris Nuernberger <chris@techascent.com> wrote:
>
> There are ways to handle datasets larger than memory.  mmap'ing one or more arrow files
and going from there is a pathway forward here:
>
> https://techascent.com/blog/memory-mapping-arrow.html
>
> How this maps to other software ecosystems I don't know but many have mmap support.
>
> On Thu, Oct 22, 2020 at 12:47 PM Jacek Pliszka <jacek.pliszka@gmail.com> wrote:
>>
>> I believe it would be good if you define your use case.
>>
>> I do handle larger than memory datasets with pyarrow with the use of
>> dataset.scan but my use case is very specific as I am repartitioning
>> and cleaning a bit large datasets.
>>
>> BR,
>>
>> Jacek
>>
>> czw., 22 paź 2020 o 20:39 Jacob Zelko <jacobszelko@gmail.com> napisał(a):
>> >
>> > Hi all,
>> >
>> > Very basic question as I have seen conflicting sources. I come from the Julia
community and was wondering if Arrow can handle larger-than-memory datasets? I saw this post
by Wes McKinney here discussing that the tooling is being laid down:
>> >
>> > Table columns in Arrow C++ can be chunked, so that appending to a table is a
zero copy operation, requiring no non-trivial computation or memory allocation. By designing
up front for streaming, chunked tables, appending to existing in-memory tabler is computationally
inexpensive relative to pandas now. Designing for chunked or streaming data is also essential
for implementing out-of-core algorithms, so we are also laying the foundation for processing
larger-than-memory datasets.
>> >
>> > ~ Apache Arrow and the “10 Things I Hate About pandas”
>> >
>> > And then in the docs I saw this:
>> >
>> > The pyarrow.dataset module provides functionality to efficiently work with tabular,
potentially larger than memory and multi-file datasets:
>> >
>> > A unified interface for different sources: supporting different sources and
file formats (Parquet, Feather files) and different file systems (local, cloud).
>> > Discovery of sources (crawling directories, handle directory-based partitioned
datasets, basic schema normalization, ..)
>> > Optimized reading with predicate pushdown (filtering rows), projection (selecting
columns), parallel reading or fine-grained managing of tasks.
>> >
>> > Currently, only Parquet and Feather / Arrow IPC files are supported. The goal
is to expand this in the future to other file formats and data sources (e.g. database connections).
>> >
>> > ~ Tabular Datasets
>> >
>> > The article from Wes was from 2017 and the snippet on Tabular Datasets is from
the current documentation for pyarrow.
>> >
>> > Could anyone answer this question or at least clear up my confusion for me?
Thank you!
>> >
>> > --
>> > Jacob Zelko
>> > Georgia Institute of Technology - Biomedical Engineering B.S. '20
>> > Corning Community College - Engineering Science A.S. '17
>> > Cell Number: (607) 846-8947

Mime
View raw message