arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacek Pliszka <jacek.plis...@gmail.com>
Subject Re: Does Arrow Support Larger-than-Memory Handling?
Date Thu, 22 Oct 2020 18:46:47 GMT
I believe it would be good if you define your use case.

I do handle larger than memory datasets with pyarrow with the use of
dataset.scan but my use case is very specific as I am repartitioning
and cleaning a bit large datasets.

BR,

Jacek

czw., 22 paź 2020 o 20:39 Jacob Zelko <jacobszelko@gmail.com> napisał(a):
>
> Hi all,
>
> Very basic question as I have seen conflicting sources. I come from the Julia community
and was wondering if Arrow can handle larger-than-memory datasets? I saw this post by Wes
McKinney here discussing that the tooling is being laid down:
>
> Table columns in Arrow C++ can be chunked, so that appending to a table is a zero copy
operation, requiring no non-trivial computation or memory allocation. By designing up front
for streaming, chunked tables, appending to existing in-memory tabler is computationally inexpensive
relative to pandas now. Designing for chunked or streaming data is also essential for implementing
out-of-core algorithms, so we are also laying the foundation for processing larger-than-memory
datasets.
>
> ~ Apache Arrow and the “10 Things I Hate About pandas”
>
> And then in the docs I saw this:
>
> The pyarrow.dataset module provides functionality to efficiently work with tabular, potentially
larger than memory and multi-file datasets:
>
> A unified interface for different sources: supporting different sources and file formats
(Parquet, Feather files) and different file systems (local, cloud).
> Discovery of sources (crawling directories, handle directory-based partitioned datasets,
basic schema normalization, ..)
> Optimized reading with predicate pushdown (filtering rows), projection (selecting columns),
parallel reading or fine-grained managing of tasks.
>
> Currently, only Parquet and Feather / Arrow IPC files are supported. The goal is to expand
this in the future to other file formats and data sources (e.g. database connections).
>
> ~ Tabular Datasets
>
> The article from Wes was from 2017 and the snippet on Tabular Datasets is from the current
documentation for pyarrow.
>
> Could anyone answer this question or at least clear up my confusion for me? Thank you!
>
> --
> Jacob Zelko
> Georgia Institute of Technology - Biomedical Engineering B.S. '20
> Corning Community College - Engineering Science A.S. '17
> Cell Number: (607) 846-8947

Mime
View raw message