arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jonathan mercier <jonathan.merc...@cnrgh.fr>
Subject How to make a parquet dataset from an input file through Random access
Date Wed, 20 Jan 2021 21:50:36 GMT
Dear,

library: pyarrow
version: current stable

I try to read a file supporting random access and convert its data into
a parquet dataset using pyarrow.

Thus I make a pool executor to process input data asynchronously.
Each process read through a stream and return at the end a
pyarrow.lib.Buffer . How to merge all those buffer in order to get one
Table ?
I know how to do it from one buffer but not a collection of buffers:
 -  RecordBatchStreamReader(source).read_all()


My current code :


sink = BufferOutputStream()
writer = RecordBatchStreamWriter(sink, a_schema)

# In order to call write at around 1Mo of data and reuse the buffer
buffer = (
        list(None for _ in range(0, 70)),
        list(None for _ in range(0, 70)),
    )
buffer_index=0

while iterator:
    try:
        buffer[0][buffer_index] = iterator.a        buffer[1][buffer_index] = iterator.b
   except IndexError:
       batch = record_batch([ buffer[0][:buffer_index],
                              buffer[1][:buffer_index]],
                            a_schema)
       writer.write(batch)
       buffer_index=0

if index != 0:
   batch = record_batch([ buffer[0][:buffer_index],
                          buffer[1][:buffer_index]],
                        a_schema)
   writer.write(batch)
writer.close()
return sink.getvalue()



Thanks for your help

best regards

-- 
                Researcher computational biology
                PhD, Jonathan MERCIER
                
                Bioinformatics (LBI)
                2, rue Gaston
                Crémieux
                91057 Evry Cedex
            
            
                Tel :(33) 1 60 87 83 44
                Email :jonathan.mercier@cnrgh.fr
                
            
        
    


Mime
View raw message