arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jonathan mercier <jonathan.merc...@cnrgh.fr>
Subject why that take so many times to read parquets file with 300 000 columns
Date Mon, 01 Mar 2021 10:23:01 GMT
Dear,

I try to studies 300 000 samples of SARS-Cov 2 with parquet/pyarrow
thus I own a table with 300 000 columns and around 45 000 row of
presence/absence (0/1). It is a  file of ~150 Mo.

I read this file like this:

import pyarrow.parquet as pq
data =
pq.read_table(dataset_path).to_pandas().to_numpy().astype(numpy.bool_)

And this statement take 1 hour …
So is there a trick to speedup to load in memory those data ?
Is it possible to distribute the loading with a library such as ray ?

thanks

Best regards


-- 
                Researcher computational biology
                PhD, Jonathan MERCIER
            
                Bioinformatics (LBI)
                2, rue Gaston
                Crémieux
                91057 Evry Cedex
            
            
                Tel :(+33)1 60 87 83 44
                Email :jonathan.mercier@cnrgh.fr
                
            


Mime
View raw message