From user-return-353-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Tue Mar 24 11:04:20 2020 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 9D01418065C for ; Tue, 24 Mar 2020 12:04:20 +0100 (CET) Received: (qmail 71476 invoked by uid 500); 24 Mar 2020 11:04:19 -0000 Mailing-List: contact user-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@arrow.apache.org Delivered-To: mailing list user@arrow.apache.org Received: (qmail 71462 invoked by uid 99); 24 Mar 2020 11:04:19 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 24 Mar 2020 11:04:19 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id CF6EDC223F for ; Tue, 24 Mar 2020 11:04:18 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.011 X-Spam-Level: X-Spam-Status: No, score=0.011 tagged_above=-999 required=6.31 tests=[KAM_DMARC_STATUS=0.01, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-ec2-va.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id tD6-QVsJtihM for ; Tue, 24 Mar 2020 11:04:15 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=193.50.0.65; helo=courrier.cng.fr; envelope-from=jonathan.mercier@cnrgh.fr; receiver= Received: from courrier.cng.fr (courrier.cng.fr [193.50.0.65]) by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTP id 20F65BB858 for ; Tue, 24 Mar 2020 11:04:14 +0000 (UTC) Received: from bioinfornatics (vir91-1-82-228-212-97.fbx.proxad.net [82.228.212.97]) by courrier.cng.fr (Postfix) with ESMTP id 0462B258067 for ; Tue, 24 Mar 2020 12:04:13 +0100 (CET) Message-ID: <9fef48d05e8508b7b77bfe6db0f6836cf19a51bd.camel@cnrgh.fr> Subject: Re: How to load custom tabular text file to pyarrow ? From: jonathan mercier To: user@arrow.apache.org Date: Tue, 24 Mar 2020 12:04:13 +0100 In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" User-Agent: Evolution 3.34.4 (3.34.4-1.fc31) MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Hi Wes Thanks for your quick answer. I took a look to pyarrow csv reader : https://github.com/apache/arrow/blob/master/cpp/src/arrow/csv/reader.cc and https://github.com/apache/arrow/blob/master/python/pyarrow/_csv.pyx I have a lot of code to undertand and write in order to expose a *.bed reader in python. I will try to do my best Thanks Have a nice day Le lundi 23 mars 2020 à 18:24 -0500, Wes McKinney a écrit : > hi Jonathan -- generally my approach would be to write some Cython or > C/C++ code to create the file loader. Any time you are writing a file > loader that deals with individual table cells in pure Python it's > going to suffer from some performance problems. > > We've talked about exposing the Arrow C++ incremental builder classes > in Python or Cython -- I didn't find a JIRA issue about this but I > created > > https://issues.apache.org/jira/browse/ARROW-8189 > > Hope this helps > Wes > > On Mon, Mar 23, 2020 at 3:10 PM jonathan mercier > wrote: > > Dear, > > > > I would like to parse *.bed file to pyarrow > > > > A Bed file look like this: > > #This is a comment > > chr1 10000 69091 > > chr1 80608 106842 > > chr3 70008 207666 > > chr14 257666 297968 > > > > > > So we can see it is a tabulated text file with 3 columns. Some line > > can > > be a comment if starts with a # > > > > > > My way to hadle such file is not efficient and I would like your > > insight to load such data > > > > My way, I read file lini by line with bython builtin open, if line > > do > > not starts with a # ; I split the line each column is converted to > > expected column type (i.e str, int …) and append each data to their > > columns. And finally I create a pyarrow table and write it to > > parquet. > > > > > > > > import pyarrow as pa > > from pyarrow.parquet import ParquetWriter > > bed3_schema = pa.schema([('chr', pa.string()), > > ('start', pa.int64()), > > ('end', pa.float64())]) > > bed3_column_type = [str, int, int] > > > > > > def bed_to_parquet(bed_path: str, parquet_path: str, dataset=None): > > columns = [[], [], []] > > with open(bed_path) as stream: > > for row in stream: > > if not row.startswith('#'): > > cols = row.split('\t') > > for i, item in enumerate(cols): > > casted_value = bed3_column_type[i](item) > > columns[i].append(casted_value) > > arrays = [pa.array(column) for column in columns] > > table = pa.Table.from_arrays(arrays, schema=bed3_schema) > > with ParquetWriter(parquet_path, table.schema, > > use_dictionary=True, version='2.0') as > > writer: > > if dataset: > > writer.write_to_dataset(table, dataset) > > else: > > writer.write_table(table) > >