Very basic question as I have seen conflicting sources. I come from the Julia community and was wondering if Arrow can handle larger-than-memory datasets? I saw this post by Wes McKinney here discussing that the tooling is being laid down:
Table columns in Arrow C++ can be chunked, so that appending to a table is a zero copy operation, requiring no non-trivial computation or memory allocation. By designing up front for streaming, chunked tables, appending to existing in-memory tabler is computationally inexpensive relative to pandas now. Designing for chunked or streaming data is also essential for implementing out-of-core algorithms, so we are also laying the foundation for processing larger-than-memory datasets.
And then in the docs I saw this:
The pyarrow.dataset module provides functionality to efficiently work with tabular, potentially larger than memory and multi-file datasets:
- A unified interface for different sources: supporting different sources and file formats (Parquet, Feather files) and different file systems (local, cloud).
- Discovery of sources (crawling directories, handle directory-based partitioned datasets, basic schema normalization, ..)
- Optimized reading with predicate pushdown (filtering rows), projection (selecting columns), parallel reading or fine-grained managing of tasks.
Currently, only Parquet and Feather / Arrow IPC files are supported. The goal is to expand this in the future to other file formats and data sources (e.g. database connections).
~ Tabular Datasets
The article from Wes was from 2017 and the snippet on Tabular Datasets is from the current documentation for
Could anyone answer this question or at least clear up my confusion for me? Thank you!--Jacob ZelkoGeorgia Institute of Technology - Biomedical Engineering B.S. '20Corning Community College - Engineering Science A.S. '17Cell Number: (607) 846-8947