Hi Amine,
I don't think there is anything in the core arrow library that helps with
this at the moment. The most efficient way for doing something like this
would probably be Customer C/C++ code to do the conversion, but I'm not an
expert in numpy.
Micah
On Tue, Nov 24, 2020 at 7:41 PM Amine Boubezari <boubezari.amine@gmail.com>
wrote:
> Hello, I have question regarding best practices with Apache Arrow. I have
> a very large dataset (10's of millions of rows) stored on a partitioned
> parquet dataset on disk. I load this dataset into memory into a
> pyarrow.Table, and drop all columns except one, which is of type MapType
> mapping integers to floats. This column represents sparse feature vector
> data to be used in an ML context. Call the number of rows "num_rows". My
> job is to transform this column to a 2D numpy array of shape ("num_rows" x
> "num_cols") where both rows and cols are known before hand. If one of my
> pyarrow.Table rows looks like [(1, 3.4), (2, 4.4), (4, 5.4), (6, 6.4)] and
> "num_cols" = 10, then that row in the numpy array would look like [0, 3.4,
> 4.4, 0, 5.4, 0, 6.4, 0, 0, 0, 0], where unmapped values are just 0. My 2D
> numpy array would just be the collection of rows from the pyarrow.Table
> transformed in such a way. What is the best, most efficient way to
> accomplish this, considering I have 10's of millions of rows? Assume I have
> enough memory to fit the entire dataset.
> Note that I can use table.to_pandas() to get a pandas DF, and then map
> functions on the pandas series, if that would help in the solution. So far
> I have been stumped, however. df.to_numpy() has not been helpful here.
>
