From user-return-353-archive-asf-public=cust-asf.ponee.io@arrow.apache.org  Tue Mar 24 11:04:20 2020
Return-Path: <user-return-353-archive-asf-public=cust-asf.ponee.io@arrow.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 9D01418065C
	for <archive-asf-public@cust-asf.ponee.io>; Tue, 24 Mar 2020 12:04:20 +0100 (CET)
Received: (qmail 71476 invoked by uid 500); 24 Mar 2020 11:04:19 -0000
Mailing-List: contact user-help@arrow.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:user-help@arrow.apache.org>
List-Unsubscribe: <mailto:user-unsubscribe@arrow.apache.org>
List-Post: <mailto:user@arrow.apache.org>
List-Id: <user.arrow.apache.org>
Reply-To: user@arrow.apache.org
Delivered-To: mailing list user@arrow.apache.org
Received: (qmail 71462 invoked by uid 99); 24 Mar 2020 11:04:19 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 24 Mar 2020 11:04:19 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id CF6EDC223F
	for <user@arrow.apache.org>; Tue, 24 Mar 2020 11:04:18 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 0.011
X-Spam-Level:
X-Spam-Status: No, score=0.011 tagged_above=-999 required=6.31
	tests=[KAM_DMARC_STATUS=0.01, RCVD_IN_DNSWL_NONE=-0.0001,
	SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001]
	autolearn=disabled
Received: from mx1-ec2-va.apache.org ([10.40.0.8])
	by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024)
	with ESMTP id tD6-QVsJtihM for <user@arrow.apache.org>;
	Tue, 24 Mar 2020 11:04:15 +0000 (UTC)
Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=193.50.0.65; helo=courrier.cng.fr; envelope-from=jonathan.mercier@cnrgh.fr; receiver=<UNKNOWN> 
Received: from courrier.cng.fr (courrier.cng.fr [193.50.0.65])
	by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTP id 20F65BB858
	for <user@arrow.apache.org>; Tue, 24 Mar 2020 11:04:14 +0000 (UTC)
Received: from bioinfornatics (vir91-1-82-228-212-97.fbx.proxad.net [82.228.212.97])
	by courrier.cng.fr (Postfix) with ESMTP id 0462B258067
	for <user@arrow.apache.org>; Tue, 24 Mar 2020 12:04:13 +0100 (CET)
Message-ID: <9fef48d05e8508b7b77bfe6db0f6836cf19a51bd.camel@cnrgh.fr>
Subject: Re: How to load custom tabular text file to pyarrow ?
From: jonathan mercier <jonathan.mercier@cnrgh.fr>
To: user@arrow.apache.org
Date: Tue, 24 Mar 2020 12:04:13 +0100
In-Reply-To: <CAJPUwMAdu=y45L0dTjbWfPbYopkHhAwc2a9MS_QmyFCBGT=mkw@mail.gmail.com>
References: <b4229fae853bd49559dae4e63ff0e4884cc99df5.camel@cnrgh.fr>
	 <CAJPUwMAdu=y45L0dTjbWfPbYopkHhAwc2a9MS_QmyFCBGT=mkw@mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"
User-Agent: Evolution 3.34.4 (3.34.4-1.fc31) 
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Hi Wes

Thanks for your quick answer. I took a look to pyarrow csv reader : 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/csv/reader.cc
and
https://github.com/apache/arrow/blob/master/python/pyarrow/_csv.pyx

I have a lot of code to undertand and write in order to expose a *.bed
reader in python.

I will try to do my best

Thanks

Have a nice day


Le lundi 23 mars 2020 à 18:24 -0500, Wes McKinney a écrit :
> hi Jonathan -- generally my approach would be to write some Cython or
> C/C++ code to create the file loader. Any time you are writing a file
> loader that deals with individual table cells in pure Python it's
> going to suffer from some performance problems.
> 
> We've talked about exposing the Arrow C++ incremental builder classes
> in Python or Cython -- I didn't find a JIRA issue about this but I
> created
> 
> https://issues.apache.org/jira/browse/ARROW-8189
> 
> Hope this helps
> Wes
> 
> On Mon, Mar 23, 2020 at 3:10 PM jonathan mercier
> <jonathan.mercier@cnrgh.fr> wrote:
> > Dear,
> > 
> > I would like to parse *.bed file to pyarrow
> > 
> > A Bed file look like this:
> > #This is a comment
> > chr1    10000   69091
> > chr1    80608   106842
> > chr3    70008   207666
> > chr14   257666  297968
> > 
> > 
> > So we can see it is a tabulated text file with 3 columns. Some line
> > can
> > be a comment if starts with a #
> > 
> > 
> > My way to hadle such file is not efficient and I would like your
> > insight to load such data
> > 
> > My way, I read file lini by line with bython builtin open, if line
> > do
> > not starts with a # ;  I split the line each column is converted to
> > expected column type (i.e str, int …) and append each data to their
> > columns. And finally I create a pyarrow table and write it to
> > parquet.
> > 
> > 
> > 
> > import pyarrow as pa
> > from pyarrow.parquet import ParquetWriter
> > bed3_schema = pa.schema([('chr', pa.string()),
> >                         ('start', pa.int64()),
> >                         ('end', pa.float64())])
> > bed3_column_type = [str, int, int]
> > 
> > 
> > def bed_to_parquet(bed_path: str, parquet_path: str, dataset=None):
> >     columns = [[], [], []]
> >     with open(bed_path) as stream:
> >         for row in stream:
> >             if not row.startswith('#'):
> >                 cols = row.split('\t')
> >                 for i, item in enumerate(cols):
> >                     casted_value = bed3_column_type[i](item)
> >                     columns[i].append(casted_value)
> >     arrays = [pa.array(column) for column in columns]
> >     table = pa.Table.from_arrays(arrays, schema=bed3_schema)
> >     with ParquetWriter(parquet_path, table.schema,
> >                        use_dictionary=True, version='2.0') as
> > writer:
> >         if dataset:
> >             writer.write_to_dataset(table, dataset)
> >         else:
> >             writer.write_table(table)
> >