arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacek Pliszka <jacek.plis...@gmail.com>
Subject Re: why that take so many times to read parquets file with 300 000 columns
Date Mon, 01 Mar 2021 10:31:38 GMT
Other will probably give you better hints but

You do not need to convert to Pandas.  read in arrow and convert to
numpy directly if numpy is what you want.

BR,

Jacek

pon., 1 mar 2021 o 11:24 jonathan mercier <jonathan.mercier@cnrgh.fr>
napisał(a):
>
> Dear,
>
> I try to studies 300 000 samples of SARS-Cov 2 with parquet/pyarrow
> thus I own a table with 300 000 columns and around 45 000 row of
> presence/absence (0/1). It is a  file of ~150 Mo.
>
> I read this file like this:
>
> import pyarrow.parquet as pq
> data =
> pq.read_table(dataset_path).to_pandas().to_numpy().astype(numpy.bool_)
>
> And this statement take 1 hour …
> So is there a trick to speedup to load in memory those data ?
> Is it possible to distribute the loading with a library such as ray ?
>
> thanks
>
> Best regards
>
>
> --
>                 Researcher computational biology
>                 PhD, Jonathan MERCIER
>
>                 Bioinformatics (LBI)
>                 2, rue Gaston
>                 Crémieux
>                 91057 Evry Cedex
>
>
>                 Tel :(+33)1 60 87 83 44
>                 Email :jonathan.mercier@cnrgh.fr
>
>
>

Mime
View raw message