arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: How to load custom tabular text file to pyarrow ?
Date Tue, 24 Mar 2020 17:24:13 GMT
I suspect you can reap significant performance benefits without going
to the engineering lengths that we've gone for general purpose CSV
parsing.

On Tue, Mar 24, 2020 at 6:04 AM jonathan mercier
<jonathan.mercier@cnrgh.fr> wrote:
>
> Hi Wes
>
> Thanks for your quick answer. I took a look to pyarrow csv reader :
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/csv/reader.cc
> and
> https://github.com/apache/arrow/blob/master/python/pyarrow/_csv.pyx
>
> I have a lot of code to undertand and write in order to expose a *.bed
> reader in python.
>
> I will try to do my best
>
> Thanks
>
> Have a nice day
>
>
> Le lundi 23 mars 2020 à 18:24 -0500, Wes McKinney a écrit :
> > hi Jonathan -- generally my approach would be to write some Cython or
> > C/C++ code to create the file loader. Any time you are writing a file
> > loader that deals with individual table cells in pure Python it's
> > going to suffer from some performance problems.
> >
> > We've talked about exposing the Arrow C++ incremental builder classes
> > in Python or Cython -- I didn't find a JIRA issue about this but I
> > created
> >
> > https://issues.apache.org/jira/browse/ARROW-8189
> >
> > Hope this helps
> > Wes
> >
> > On Mon, Mar 23, 2020 at 3:10 PM jonathan mercier
> > <jonathan.mercier@cnrgh.fr> wrote:
> > > Dear,
> > >
> > > I would like to parse *.bed file to pyarrow
> > >
> > > A Bed file look like this:
> > > #This is a comment
> > > chr1    10000   69091
> > > chr1    80608   106842
> > > chr3    70008   207666
> > > chr14   257666  297968
> > >
> > >
> > > So we can see it is a tabulated text file with 3 columns. Some line
> > > can
> > > be a comment if starts with a #
> > >
> > >
> > > My way to hadle such file is not efficient and I would like your
> > > insight to load such data
> > >
> > > My way, I read file lini by line with bython builtin open, if line
> > > do
> > > not starts with a # ;  I split the line each column is converted to
> > > expected column type (i.e str, int …) and append each data to their
> > > columns. And finally I create a pyarrow table and write it to
> > > parquet.
> > >
> > >
> > >
> > > import pyarrow as pa
> > > from pyarrow.parquet import ParquetWriter
> > > bed3_schema = pa.schema([('chr', pa.string()),
> > >                         ('start', pa.int64()),
> > >                         ('end', pa.float64())])
> > > bed3_column_type = [str, int, int]
> > >
> > >
> > > def bed_to_parquet(bed_path: str, parquet_path: str, dataset=None):
> > >     columns = [[], [], []]
> > >     with open(bed_path) as stream:
> > >         for row in stream:
> > >             if not row.startswith('#'):
> > >                 cols = row.split('\t')
> > >                 for i, item in enumerate(cols):
> > >                     casted_value = bed3_column_type[i](item)
> > >                     columns[i].append(casted_value)
> > >     arrays = [pa.array(column) for column in columns]
> > >     table = pa.Table.from_arrays(arrays, schema=bed3_schema)
> > >     with ParquetWriter(parquet_path, table.schema,
> > >                        use_dictionary=True, version='2.0') as
> > > writer:
> > >         if dataset:
> > >             writer.write_to_dataset(table, dataset)
> > >         else:
> > >             writer.write_table(table)
> > >
>

Mime
View raw message