arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jorge Cardoso Leitão <jorgecarlei...@gmail.com>
Subject Re: why that take so many times to read parquets file with 300 000 columns
Date Mon, 01 Mar 2021 13:22:01 GMT
I understand that this does not answer the question, but it may be worth
pointing out regardless: if you control the writing, it may be more
suitable to encode the columns and use a link list for the problem: encode
each column by a number x and store the data as two columns. For example:

id, x0, x1, x2, ...
0, 0, 1, 0
1, 1, 1, 0
2, 1, 1, 1

becomes

id, x
0, 1  // id=0,x1=1
1, 0  // id=1,x0=1
1, 1  // id=1,x1=1
2, 0  // id=2,x0=1
2, 1  // id=2,x1=1
2, 2  // id=2,x2=1

This approach is often used in complex (sparse) networks and can lead to a
significantly lower number of stores and reads. Performance depends on the
problem, so this is just an idea.

Best,
Jorge






On Mon, Mar 1, 2021 at 2:08 PM jonathan mercier <jonathan.mercier@cnrgh.fr>
wrote:

> Thanks for the hint.
> I do not saw a to_numpy method from Tabl object so I think I have to do
> it manually in python
>
> something like:
>
> #### python3
>
> import pyarrow.parquet as pq
> import numpy as np
> data = pq.read_table(dataset_path')
> matrix = np.zeros((data.num_rows,data.num_columns),dtype=np.bool_)
> for i,col in enumerate(data.columns):
>     matrix[:,i] = col
>
>
>
>
> Le lundi 01 mars 2021 à 11:31 +0100, Jacek Pliszka a écrit :
> > Other will probably give you better hints but
> >
> > You do not need to convert to Pandas.  read in arrow and convert to
> > numpy directly if numpy is what you want.
> >
> > BR,
> >
> > Jacek
> >
> > pon., 1 mar 2021 o 11:24 jonathan mercier <jonathan.mercier@cnrgh.fr>
> > napisał(a):
> > >
> > > Dear,
> > >
> > > I try to studies 300 000 samples of SARS-Cov 2 with parquet/pyarrow
> > > thus I own a table with 300 000 columns and around 45 000 row of
> > > presence/absence (0/1). It is a  file of ~150 Mo.
> > >
> > > I read this file like this:
> > >
> > > import pyarrow.parquet as pq
> > > data =
> > > pq.read_table(dataset_path).to_pandas().to_numpy().astype(numpy.bool_
> > > )
> > >
> > > And this statement take 1 hour …
> > > So is there a trick to speedup to load in memory those data ?
> > > Is it possible to distribute the loading with a library such as ray ?
> > >
> > > thanks
> > >
> > > Best regards
> > >
> > >
> > > --
> > >                 Researcher computational biology
> > >                 PhD, Jonathan MERCIER
> > >
> > >                 Bioinformatics (LBI)
> > >                 2, rue Gaston
> > >                 Crémieux
> > >                 91057 Evry Cedex
> > >
> > >
> > >                 Tel :(+33)1 60 87 83 44
> > >                 Email :jonathan.mercier@cnrgh.fr
> > >
> > >
> > >
>
> --
>                 Researcher computational biology
>                 PhD, Jonathan MERCIER
>
>                 Bioinformatics (LBI)
>                 2, rue Gaston
>                 Crémieux
>                 91057 Evry Cedex
>
>
>                 Tel :(+33)1 60 87 83 44
>                 Email :jonathan.mercier@cnrgh.fr
>
>
>
>

Mime
View raw message