arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: How to load custom tabular text file to pyarrow ?
Date Mon, 23 Mar 2020 23:24:57 GMT
hi Jonathan -- generally my approach would be to write some Cython or
C/C++ code to create the file loader. Any time you are writing a file
loader that deals with individual table cells in pure Python it's
going to suffer from some performance problems.

We've talked about exposing the Arrow C++ incremental builder classes
in Python or Cython -- I didn't find a JIRA issue about this but I
created

https://issues.apache.org/jira/browse/ARROW-8189

Hope this helps
Wes

On Mon, Mar 23, 2020 at 3:10 PM jonathan mercier
<jonathan.mercier@cnrgh.fr> wrote:
>
> Dear,
>
> I would like to parse *.bed file to pyarrow
>
> A Bed file look like this:
> #This is a comment
> chr1    10000   69091
> chr1    80608   106842
> chr3    70008   207666
> chr14   257666  297968
>
>
> So we can see it is a tabulated text file with 3 columns. Some line can
> be a comment if starts with a #
>
>
> My way to hadle such file is not efficient and I would like your
> insight to load such data
>
> My way, I read file lini by line with bython builtin open, if line do
> not starts with a # ;  I split the line each column is converted to
> expected column type (i.e str, int …) and append each data to their
> columns. And finally I create a pyarrow table and write it to parquet.
>
>
>
> import pyarrow as pa
> from pyarrow.parquet import ParquetWriter
> bed3_schema = pa.schema([('chr', pa.string()),
>                         ('start', pa.int64()),
>                         ('end', pa.float64())])
> bed3_column_type = [str, int, int]
>
>
> def bed_to_parquet(bed_path: str, parquet_path: str, dataset=None):
>     columns = [[], [], []]
>     with open(bed_path) as stream:
>         for row in stream:
>             if not row.startswith('#'):
>                 cols = row.split('\t')
>                 for i, item in enumerate(cols):
>                     casted_value = bed3_column_type[i](item)
>                     columns[i].append(casted_value)
>     arrays = [pa.array(column) for column in columns]
>     table = pa.Table.from_arrays(arrays, schema=bed3_schema)
>     with ParquetWriter(parquet_path, table.schema,
>                        use_dictionary=True, version='2.0') as writer:
>         if dataset:
>             writer.write_to_dataset(table, dataset)
>         else:
>             writer.write_table(table)
>

Mime
View raw message