arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacob Zelko <jacobsze...@gmail.com>
Subject Does Arrow Support Larger-than-Memory Handling?
Date Thu, 22 Oct 2020 18:38:55 GMT
Hi all,

Very basic question as I have seen conflicting sources. I come from the
Julia community and was wondering if Arrow can handle larger-than-memory
datasets? I saw this post by Wes McKinney here discussing that the tooling
is being laid down:

Table columns in Arrow C++ can be chunked, so that appending to a table is
a zero copy operation, requiring no non-trivial computation or memory
allocation. By designing up front for streaming, chunked tables, appending
to existing in-memory tabler is computationally inexpensive relative to
pandas now. Designing for chunked or streaming data is also essential for
implementing out-of-core algorithms, so we are also laying the foundation
for processing larger-than-memory datasets.

~ *Apache Arrow and the “10 Things I Hate About pandas”*
<https://wesmckinney.com/blog/apache-arrow-pandas-internals/>

And then in the docs I saw this:

The pyarrow.dataset module provides functionality to efficiently work with
tabular, potentially larger than memory and multi-file datasets:

   - A unified interface for different sources: supporting different
   sources and file formats (Parquet, Feather files) and different file
   systems (local, cloud).
   - Discovery of sources (crawling directories, handle directory-based
   partitioned datasets, basic schema normalization, ..)
   - Optimized reading with predicate pushdown (filtering rows), projection
   (selecting columns), parallel reading or fine-grained managing of tasks.

Currently, only Parquet and Feather / Arrow IPC files are supported. The
goal is to expand this in the future to other file formats and data sources
(e.g. database connections).

~ *Tabular Datasets* <https://arrow.apache.org/docs/python/dataset.html>

The article from Wes was from 2017 and the snippet on Tabular Datasets is
from the current documentation for pyarrow.

Could anyone answer this question or at least clear up my confusion for me?
Thank you!
-- 
Jacob Zelko
Georgia Institute of Technology - Biomedical Engineering B.S. '20
Corning Community College - Engineering Science A.S. '17
Cell Number: (607) 846-8947

Mime
View raw message