arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amine Boubezari <boubezari.am...@gmail.com>
Subject How to best get data from pyarrow.Table column of MapType to a 2D numpy array?
Date Wed, 25 Nov 2020 03:41:05 GMT
Hello, I have question regarding best practices with Apache Arrow. I have a very large dataset
(10's of millions of rows) stored on a partitioned parquet dataset on disk. I load this dataset
into memory into a pyarrow.Table, and drop all columns except one, which is of type MapType
mapping integers to floats. This column represents sparse feature vector data to be used in
an ML context. Call the number of rows "num_rows". My job is to transform this column to a
2D numpy array of shape ("num_rows" x "num_cols") where both rows and cols are known before
hand. If one of my pyarrow.Table rows looks like [(1, 3.4), (2, 4.4), (4, 5.4), (6, 6.4)]
and "num_cols" = 10, then that row in the numpy array would look like [0, 3.4, 4.4, 0, 5.4,
0, 6.4, 0, 0, 0, 0], where unmapped values are just 0. My 2D numpy array would just be the
collection of rows from the pyarrow.Table transformed in such a way. What is the best, most
efficient way to accomplish this, considering I have 10's of millions of rows? Assume I have
enough memory to fit the entire dataset.

Note that I can use table.to_pandas() to get a pandas DF, and then map functions on the pandas
series, if that would help in the solution. So far I have been stumped, however. df.to_numpy()
has not been helpful here.
Mime
View raw message