arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Nugent <nug...@gmail.com>
Subject Re: Reading large csv file with pyarrow
Date Tue, 18 Feb 2020 15:35:34 GMT
Exposing streaming csv reads would be useful independent of the datasets api for ETL processes.
On Feb 18, 2020, 03:25 -0500, Wes McKinney <wesmckinn@gmail.com>, wrote:
> Yes, that looks right. There will need to be corresponding work in
> Python to make this available (probably through the datasets API)
>
> On Mon, Feb 17, 2020 at 12:35 PM Daniel Nugent <nugend@gmail.com> wrote:
> >
> > Arrow-3410 maybe?
> > On Feb 17, 2020, 07:47 -0500, Wes McKinney <wesmckinn@gmail.com>, wrote:
> >
> > I seem to recall discussions about 1 chunk-at-a-time reading of CSV
> > files. Such an API is not yet available in Python. This is also
> > required for the C++ Datasets API. If there are not one or more JIRA
> > issues about this I suggest that we open some to capture the use cases
> >
> > On Fri, Feb 14, 2020 at 3:16 PM filippo medri <filippo.medri@gmail.com> wrote:
> >
> >
> > Hi,
> > by experimenting with arrow read_csv function to convert csv fie into parquet I
found that it reads the data in memory.
> > On a side the ReadOptions class allows to specify a blocksize parameter to limit
how much bytes to process at a time, but by looking at the memory usage my understanding is
that the underlying Table is filled with all data.
> > Is there a way to at least specify a parameter to limit the read to a batch of rows?
I see that I can skip rows from the beginning, but I am not finding a way to limit how many
rows to read.
> > Which is the intended way to read a csv file that does not fit into memory?
> > Thanks in advance,
> > Filippo Medri

Mime
View raw message