arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Nuernberger <ch...@techascent.com>
Subject Re: Does Arrow Support Larger-than-Memory Handling?
Date Thu, 22 Oct 2020 19:11:29 GMT
There are ways to handle datasets larger than memory.  mmap'ing one or more
arrow files and going from there is a pathway forward here:

https://techascent.com/blog/memory-mapping-arrow.html

How this maps to other software ecosystems I don't know but many have mmap
support.

On Thu, Oct 22, 2020 at 12:47 PM Jacek Pliszka <jacek.pliszka@gmail.com>
wrote:

> I believe it would be good if you define your use case.
>
> I do handle larger than memory datasets with pyarrow with the use of
> dataset.scan but my use case is very specific as I am repartitioning
> and cleaning a bit large datasets.
>
> BR,
>
> Jacek
>
> czw., 22 paź 2020 o 20:39 Jacob Zelko <jacobszelko@gmail.com> napisał(a):
> >
> > Hi all,
> >
> > Very basic question as I have seen conflicting sources. I come from the
> Julia community and was wondering if Arrow can handle larger-than-memory
> datasets? I saw this post by Wes McKinney here discussing that the tooling
> is being laid down:
> >
> > Table columns in Arrow C++ can be chunked, so that appending to a table
> is a zero copy operation, requiring no non-trivial computation or memory
> allocation. By designing up front for streaming, chunked tables, appending
> to existing in-memory tabler is computationally inexpensive relative to
> pandas now. Designing for chunked or streaming data is also essential for
> implementing out-of-core algorithms, so we are also laying the foundation
> for processing larger-than-memory datasets.
> >
> > ~ Apache Arrow and the “10 Things I Hate About pandas”
> >
> > And then in the docs I saw this:
> >
> > The pyarrow.dataset module provides functionality to efficiently work
> with tabular, potentially larger than memory and multi-file datasets:
> >
> > A unified interface for different sources: supporting different sources
> and file formats (Parquet, Feather files) and different file systems
> (local, cloud).
> > Discovery of sources (crawling directories, handle directory-based
> partitioned datasets, basic schema normalization, ..)
> > Optimized reading with predicate pushdown (filtering rows), projection
> (selecting columns), parallel reading or fine-grained managing of tasks.
> >
> > Currently, only Parquet and Feather / Arrow IPC files are supported. The
> goal is to expand this in the future to other file formats and data sources
> (e.g. database connections).
> >
> > ~ Tabular Datasets
> >
> > The article from Wes was from 2017 and the snippet on Tabular Datasets
> is from the current documentation for pyarrow.
> >
> > Could anyone answer this question or at least clear up my confusion for
> me? Thank you!
> >
> > --
> > Jacob Zelko
> > Georgia Institute of Technology - Biomedical Engineering B.S. '20
> > Corning Community College - Engineering Science A.S. '17
> > Cell Number: (607) 846-8947
>

Mime
View raw message