arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fernando Herrera <fernando.j.herr...@gmail.com>
Subject Re: why that take so many times to read parquets file with 300 000 columns
Date Mon, 01 Mar 2021 15:13:29 GMT
@jonathan. If your file is already a parquet file you can read it with
pandas using pd.read_parquet. If it isn't, I have found that it is better
to use any other method to read the file and then create the dataframe. One
you have it, save it as a Parquet with pandas.to_parquet.

@jorge. The structure you suggest would it be better represented like this?

Id, x
0, [1]
1, [0,1]
2, [0,1,2]

And if you ever need to store values you could use this

Id, x
0, {1:9}
1, {0:8,1:5}
2, {0:6,1:3,2:1}

Which represents

id, x0, x1, x2, ...
0, 0, 9, 0
1, 8, 5, 0
2, 6, 3, 1

I've been thinking about this because I want to find a way to represent
graphs in arrow

Fernando,

On Mon, 1 Mar 2021, 13:22 Jorge Cardoso Leitão, <jorgecarleitao@gmail.com>
wrote:

> I understand that this does not answer the question, but it may be worth
> pointing out regardless: if you control the writing, it may be more
> suitable to encode the columns and use a link list for the problem: encode
> each column by a number x and store the data as two columns. For example:
>
> id, x0, x1, x2, ...
> 0, 0, 1, 0
> 1, 1, 1, 0
> 2, 1, 1, 1
>
> becomes
>
> id, x
> 0, 1  // id=0,x1=1
> 1, 0  // id=1,x0=1
> 1, 1  // id=1,x1=1
> 2, 0  // id=2,x0=1
> 2, 1  // id=2,x1=1
> 2, 2  // id=2,x2=1
>
> This approach is often used in complex (sparse) networks and can lead to a
> significantly lower number of stores and reads. Performance depends on
> the problem, so this is just an idea.
>
> Best,
> Jorge
>
>
>
>
>
>
> On Mon, Mar 1, 2021 at 2:08 PM jonathan mercier <jonathan.mercier@cnrgh.fr>
> wrote:
>
>> Thanks for the hint.
>> I do not saw a to_numpy method from Tabl object so I think I have to do
>> it manually in python
>>
>> something like:
>>
>> #### python3
>>
>> import pyarrow.parquet as pq
>> import numpy as np
>> data = pq.read_table(dataset_path')
>> matrix = np.zeros((data.num_rows,data.num_columns),dtype=np.bool_)
>> for i,col in enumerate(data.columns):
>>     matrix[:,i] = col
>>
>>
>>
>>
>> Le lundi 01 mars 2021 à 11:31 +0100, Jacek Pliszka a écrit :
>> > Other will probably give you better hints but
>> >
>> > You do not need to convert to Pandas.  read in arrow and convert to
>> > numpy directly if numpy is what you want.
>> >
>> > BR,
>> >
>> > Jacek
>> >
>> > pon., 1 mar 2021 o 11:24 jonathan mercier <jonathan.mercier@cnrgh.fr>
>> > napisał(a):
>> > >
>> > > Dear,
>> > >
>> > > I try to studies 300 000 samples of SARS-Cov 2 with parquet/pyarrow
>> > > thus I own a table with 300 000 columns and around 45 000 row of
>> > > presence/absence (0/1). It is a  file of ~150 Mo.
>> > >
>> > > I read this file like this:
>> > >
>> > > import pyarrow.parquet as pq
>> > > data =
>> > > pq.read_table(dataset_path).to_pandas().to_numpy().astype(numpy.bool_
>> > > )
>> > >
>> > > And this statement take 1 hour …
>> > > So is there a trick to speedup to load in memory those data ?
>> > > Is it possible to distribute the loading with a library such as ray ?
>> > >
>> > > thanks
>> > >
>> > > Best regards
>> > >
>> > >
>> > > --
>> > >                 Researcher computational biology
>> > >                 PhD, Jonathan MERCIER
>> > >
>> > >                 Bioinformatics (LBI)
>> > >                 2, rue Gaston
>> > >                 Crémieux
>> > >                 91057 Evry Cedex
>> > >
>> > >
>> > >                 Tel :(+33)1 60 87 83 44
>> > >                 Email :jonathan.mercier@cnrgh.fr
>> > >
>> > >
>> > >
>>
>> --
>>                 Researcher computational biology
>>                 PhD, Jonathan MERCIER
>>
>>                 Bioinformatics (LBI)
>>                 2, rue Gaston
>>                 Crémieux
>>                 91057 Evry Cedex
>>
>>
>>                 Tel :(+33)1 60 87 83 44
>>                 Email :jonathan.mercier@cnrgh.fr
>>
>>
>>
>>

Mime
View raw message