arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jonathan mercier <jonathan.merc...@cnrgh.fr>
Subject Is it more efficient to store string into fixed binary size ?
Date Sat, 28 Mar 2020 17:12:49 GMT
Dear,

I continue my learning with arrow. 

1/ I would like to know if it is more efficient to store string as fxed
binary size ?

example:

---------------------------------------------------------------------
from pyarrow import Schema, Table, binary, schema, array
def to_binaries( s: str, size: int = 50) -> bytes: 
     nullchar = size - len(s) 
     if nullchar < 0: 
         raise Exception(f'String has more than {size} character:
{s}') 
     b = s.encode('ascii') + b'\0' * nullchar 
     return b 


fields = [('ID', binary(50))]
sc = schema(fields)
d = [ 'test', 'ab', 'bc', 'cd' ]
db = array([ to_binaries(x) for x in d ], type=binary(50))
t = Table.from_arrays(arrays=[db], schema=sc)

---------------------------------------------------------------------

2/ I misunderstood the part of Writing and Reading Streams
     \_ 
https://arrow.apache.org/docs/python/ipc.html#writing-and-reading-streams

I we use the provided writer, the final file format it is parquet file
?

Thanks

best regards


Mime
View raw message