Hi Jonathan,
This won't directly help your situation but Parquet generally scales better with fewer columns and more rows, so at least transposing the data would help with load time (I also agree that modelling with even fewer columns as suggested above would help even more).  

What version of pyarrow are you using?  In the past there have been some regressions [1] that used O(N^2) algorithms to decode metadata, it is possible one has crept back in.  

-Micah

[1] https://issues.apache.org/jira/browse/ARROW-7059



On Mon, Mar 1, 2021 at 8:06 AM Jacek Pliszka <jacek.pliszka@gmail.com> wrote:
Should be - if you need cast...

t.column(i).cast(..) uses arrow cast..

BR,

Jacek

pon., 1 mar 2021 o 17:04 Jacek Pliszka <jacek.pliszka@gmail.com> napisał(a):
>
> Use np.column_stack and list comprehension:
>
> t = pq.read_table('a.pq')
> matrix = np.column_stack([t.column(i) for i in range(t.num_columns)])
>
> If you need case - use pyarrow or numpy one - depending on your case.
>
> BR,
>
> Jacek
>
> pon., 1 mar 2021 o 14:07 jonathan mercier <jonathan.mercier@cnrgh.fr>
> napisał(a):
> >
> > Thanks for the hint.
> > I do not saw a to_numpy method from Tabl object so I think I have to do
> > it manually in python
> >
> > something like:
> >
> > #### python3
> >
> > import pyarrow.parquet as pq
> > import numpy as np
> > data = pq.read_table(dataset_path')
> > matrix = np.zeros((data.num_rows,data.num_columns),dtype=np.bool_)
> > for i,col in enumerate(data.columns):
> >     matrix[:,i] = col
> >
> >
> >
> >
> > Le lundi 01 mars 2021 à 11:31 +0100, Jacek Pliszka a écrit :
> > > Other will probably give you better hints but
> > >
> > > You do not need to convert to Pandas.  read in arrow and convert to
> > > numpy directly if numpy is what you want.
> > >
> > > BR,
> > >
> > > Jacek
> > >
> > > pon., 1 mar 2021 o 11:24 jonathan mercier <jonathan.mercier@cnrgh.fr>
> > > napisał(a):
> > > >
> > > > Dear,
> > > >
> > > > I try to studies 300 000 samples of SARS-Cov 2 with parquet/pyarrow
> > > > thus I own a table with 300 000 columns and around 45 000 row of
> > > > presence/absence (0/1). It is a  file of ~150 Mo.
> > > >
> > > > I read this file like this:
> > > >
> > > > import pyarrow.parquet as pq
> > > > data =
> > > > pq.read_table(dataset_path).to_pandas().to_numpy().astype(numpy.bool_
> > > > )
> > > >
> > > > And this statement take 1 hour …
> > > > So is there a trick to speedup to load in memory those data ?
> > > > Is it possible to distribute the loading with a library such as ray ?
> > > >
> > > > thanks
> > > >
> > > > Best regards
> > > >
> > > >
> > > > --
> > > >                 Researcher computational biology
> > > >                 PhD, Jonathan MERCIER
> > > >
> > > >                 Bioinformatics (LBI)
> > > >                 2, rue Gaston
> > > >                 Crémieux
> > > >                 91057 Evry Cedex
> > > >
> > > >
> > > >                 Tel :(+33)1 60 87 83 44
> > > >                 Email :jonathan.mercier@cnrgh.fr
> > > >
> > > >
> > > >
> >
> > --
> >                 Researcher computational biology
> >                 PhD, Jonathan MERCIER
> >
> >                 Bioinformatics (LBI)
> >                 2, rue Gaston
> >                 Crémieux
> >                 91057 Evry Cedex
> >
> >
> >                 Tel :(+33)1 60 87 83 44
> >                 Email :jonathan.mercier@cnrgh.fr
> >
> >
> >