From user-return-352-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Mon Mar 23 23:25:47 2020 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id EBDBC18064F for ; Tue, 24 Mar 2020 00:25:46 +0100 (CET) Received: (qmail 36487 invoked by uid 500); 23 Mar 2020 23:25:46 -0000 Mailing-List: contact user-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@arrow.apache.org Delivered-To: mailing list user@arrow.apache.org Received: (qmail 36477 invoked by uid 99); 23 Mar 2020 23:25:46 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 23 Mar 2020 23:25:46 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 835BA1A413C for ; Mon, 23 Mar 2020 23:25:45 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.199 X-Spam-Level: X-Spam-Status: No, score=-0.199 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-he-de.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id M5fIfahjmmeU for ; Mon, 23 Mar 2020 23:25:41 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2607:f8b0:4864:20::d2e; helo=mail-io1-xd2e.google.com; envelope-from=wesmckinn@gmail.com; receiver= Received: from mail-io1-xd2e.google.com (mail-io1-xd2e.google.com [IPv6:2607:f8b0:4864:20::d2e]) by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id 950A27F633 for ; Mon, 23 Mar 2020 23:25:40 +0000 (UTC) Received: by mail-io1-xd2e.google.com with SMTP id h131so16310337iof.1 for ; Mon, 23 Mar 2020 16:25:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :content-transfer-encoding; bh=8yR5yx4xLtCxPQpcW10R27pNTFZxE42+UNyWP/v2djg=; b=VNvEug0SOWr1Uwa6duweBCcckb40yv7ZfxtjtX7bdlOhJ/EVCYBSHcrA2SY9UipXf7 6py4PbS0ccMl7boO4Yn0diVnvqpboqXyT7A2t86+qcXPPIhfMMxcCrkS3ASicQGXA/Gk dM3FzTDqlX680C+gXSTl3yCcf7HyaD/NAEC8o15CnYpkRYLkup4pslQT8AW1ZpoOlqOh cA61Xd//O5NekY5IBvs22xnlxAvpmrUqX4QTnXm1VWQTqS2/eWBas8jcrHYiffLnkzn4 mAjyGbDAopv0W+gFVzmdAR6pM8dsSH5Z7KNLMvQbwCr6Kk8GKKITsWrXo9RP8h48ANg1 vraA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:content-transfer-encoding; bh=8yR5yx4xLtCxPQpcW10R27pNTFZxE42+UNyWP/v2djg=; b=O87HmCd7miC37W2vdjvyGEdXYm0IlwoYHXks9qxXFmkMcFZ//kRKX8fu8PIZ/Xk9kh EwB7kvNPFBYTgo3yQxJT7K7ImFai35q2zgU330mEo2xX8/7DgNPCyOj4DM12GX4fYNYG XM/TzcIkDVu271ILu2h7I0OguTKvc3Lc3tVttrOB4T+Carqn9q6YeyJAvTVKPaNEIE56 atfLGqDEUCwKEoAL373lreGgquWrsax0dqfhv+tf2iA77hyJADOfH6YXTCdBzsyqVv3o Iug7dSUrJ+Xd81Xq+OAVIy6rJdmftlU5jcZ0I8E9w/xZmHisi8u3fDHLUKV8UNzwU6ml QV3A== X-Gm-Message-State: ANhLgQ1rl1Bv2ZvsZyUf8SUj+aOjP6MJUKZb8osdWJ/U4DKP1RcpZOMN trgcUM9HF3LszojQFBd4fx5y104vW+nV2Tewp0fXGCXO5PQ= X-Google-Smtp-Source: ADFU+vs+7T2cmLlOQYHROiFX8ZXMyeI4sFxf8yKeS0xZuTsR5FsZeYwsUDO6tZ2YVg72fQzYqdin8hMf0yq02REUoqw= X-Received: by 2002:a5e:c803:: with SMTP id y3mr18726311iol.82.1585005932842; Mon, 23 Mar 2020 16:25:32 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Wes McKinney Date: Mon, 23 Mar 2020 18:24:57 -0500 Message-ID: Subject: Re: How to load custom tabular text file to pyarrow ? To: user@arrow.apache.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable hi Jonathan -- generally my approach would be to write some Cython or C/C++ code to create the file loader. Any time you are writing a file loader that deals with individual table cells in pure Python it's going to suffer from some performance problems. We've talked about exposing the Arrow C++ incremental builder classes in Python or Cython -- I didn't find a JIRA issue about this but I created https://issues.apache.org/jira/browse/ARROW-8189 Hope this helps Wes On Mon, Mar 23, 2020 at 3:10 PM jonathan mercier wrote: > > Dear, > > I would like to parse *.bed file to pyarrow > > A Bed file look like this: > #This is a comment > chr1 10000 69091 > chr1 80608 106842 > chr3 70008 207666 > chr14 257666 297968 > > > So we can see it is a tabulated text file with 3 columns. Some line can > be a comment if starts with a # > > > My way to hadle such file is not efficient and I would like your > insight to load such data > > My way, I read file lini by line with bython builtin open, if line do > not starts with a # ; I split the line each column is converted to > expected column type (i.e str, int =E2=80=A6) and append each data to the= ir > columns. And finally I create a pyarrow table and write it to parquet. > > > > import pyarrow as pa > from pyarrow.parquet import ParquetWriter > bed3_schema =3D pa.schema([('chr', pa.string()), > ('start', pa.int64()), > ('end', pa.float64())]) > bed3_column_type =3D [str, int, int] > > > def bed_to_parquet(bed_path: str, parquet_path: str, dataset=3DNone): > columns =3D [[], [], []] > with open(bed_path) as stream: > for row in stream: > if not row.startswith('#'): > cols =3D row.split('\t') > for i, item in enumerate(cols): > casted_value =3D bed3_column_type[i](item) > columns[i].append(casted_value) > arrays =3D [pa.array(column) for column in columns] > table =3D pa.Table.from_arrays(arrays, schema=3Dbed3_schema) > with ParquetWriter(parquet_path, table.schema, > use_dictionary=3DTrue, version=3D'2.0') as writer: > if dataset: > writer.write_to_dataset(table, dataset) > else: > writer.write_table(table) >