From user-return-749-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Wed Nov 4 23:10:17 2020 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mailroute1-lw-us.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with ESMTPS id C1ED718060E for ; Thu, 5 Nov 2020 00:10:17 +0100 (CET) Received: from mail.apache.org (localhost [127.0.0.1]) by mailroute1-lw-us.apache.org (ASF Mail Server at mailroute1-lw-us.apache.org) with SMTP id C48C2121D38 for ; Wed, 4 Nov 2020 23:10:16 +0000 (UTC) Received: (qmail 62733 invoked by uid 500); 4 Nov 2020 23:10:16 -0000 Mailing-List: contact user-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@arrow.apache.org Delivered-To: mailing list user@arrow.apache.org Received: (qmail 62723 invoked by uid 99); 4 Nov 2020 23:10:16 -0000 Received: from spamproc1-he-de.apache.org (HELO spamproc1-he-de.apache.org) (116.203.196.100) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Nov 2020 23:10:16 +0000 Received: from localhost (localhost [127.0.0.1]) by spamproc1-he-de.apache.org (ASF Mail Server at spamproc1-he-de.apache.org) with ESMTP id B2C141FF39C for ; Wed, 4 Nov 2020 23:10:15 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamproc1-he-de.apache.org X-Spam-Flag: NO X-Spam-Score: -0.202 X-Spam-Level: X-Spam-Status: No, score=-0.202 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamproc1-he-de.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-ec2-va.apache.org ([116.203.227.195]) by localhost (spamproc1-he-de.apache.org [116.203.196.100]) (amavisd-new, port 10024) with ESMTP id g_qHThGQ8aUm for ; Wed, 4 Nov 2020 23:10:15 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=209.85.167.54; helo=mail-lf1-f54.google.com; envelope-from=wesmckinn@gmail.com; receiver= Received: from mail-lf1-f54.google.com (mail-lf1-f54.google.com [209.85.167.54]) by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id C6718BC2BA for ; Wed, 4 Nov 2020 23:10:14 +0000 (UTC) Received: by mail-lf1-f54.google.com with SMTP id l2so107996lfk.0 for ; Wed, 04 Nov 2020 15:10:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=TrUpTVTP9K2NaZxqREUTP9k/4bsAnyeVLH7Fdj2FJC8=; b=mF5QBenJXDLP++0m/s2BKRMBYxerIoA9cBBLN4HG9heF+HkI2M4+txjSwySRn/shtR mY6ITr3pFNMa4onglsVANSzVAq/eBTNJgp9vetLDSZHDxU1LKmprpUnx99VsBl93Mquu X6Yk7Y9ra7KpOdGB8i2JwPofONVYVQHv3g8AoNLwV5PHO+frjOws3GVUqHsxfo7sz1jy y2ahZjBhlGFtFpf6h7kDqyPHW+Xdggm62mx0oujFsLCvib/rxQ7tyPnDVs8/vmAEDvhA LYgwCP+aLJcTiRMYTEBxlyFXDjQdxHK3nI+U+kozwT/caWBYrnFWLucS5WHFZAJ3jbDc PzNg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=TrUpTVTP9K2NaZxqREUTP9k/4bsAnyeVLH7Fdj2FJC8=; b=h+3cEqE1LnjXPFAsnr7ZwjqFa/4XQ9Wvpu5FvA0X1vrLebjatOdgEbJCF1MsPVnp/s fg7yrBwTj/sKb21YeHgNcfgGhH2PHySRSxyq8YFnllWZ9zwSqHEdrrfVNVqfPgon74AX Mii7PG9ZAECnUcNihj5j1pV0TbrusthQYpF43hJJrNPLNVt8WUeRZx/o3KJJKggrC1uJ 0f7xmGhRlr45m82rtWOCMIa06nvD479XBdZ3e8AQzCPXORcvY64Fkwlx0fhKDOYfxnqO Hh5WPF5Io2mMd4Vdbaumra7vQBwyo7Ce42a82DX14OWijSX6MPGUxS4ooH9ma8hz9iTQ DqGQ== X-Gm-Message-State: AOAM531GRd2IVeBV/RM9mBWmctMrVx6nSlAqHKkRrB9IjgCiRIwsjSML ggvN1F3O+MqMSvZpan+1SxU9b1GwwiNLuveeaNbEfk6rYTA= X-Google-Smtp-Source: ABdhPJw7K1ly850laIRbrNbyetlJ47/+fdkVMCz4QEe+TfhoOdrrfP1pqL0f97kO8lYGS4HD4C1L4oweS06OIYD3Fa0= X-Received: by 2002:a19:89c2:: with SMTP id l185mr24214lfd.92.1604531413256; Wed, 04 Nov 2020 15:10:13 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Wes McKinney Date: Wed, 4 Nov 2020 17:09:37 -0600 Message-ID: Subject: Re: bug? pyarrow.Table.from_pydict does not handle binary type correctly with embedded 00 bytes? To: user@arrow.apache.org Content-Type: text/plain; charset="UTF-8" Seems a bit buggy, can you open a Jira issue? Thanks On Wed, Nov 4, 2020 at 5:05 PM Jason Sachs wrote: > > It looks like pyarrow.Table.from_pydict() cuts off binary data after an embedded 00 byte. Is this a known bug? > > (py3) C:\>python > Python 3.8.5 (default, Sep 3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32 > Type "help", "copyright", "credits" or "license" for more information. > >>> import numpy as np > >>> import pyarrow as pa > >>> > >>> data = np.array([b'', b'', b'', b'Foo!!', b'Bar!!', > .. b'\x00Baz!', b'half\x00baked', b''], dtype='|S13') > >>> t = pa.Table.from_pydict({'data':data}) > >>> t.to_pandas() > data > 0 b'' > 1 b'' > 2 b'' > 3 b'Foo!!' > 4 b'Bar!!' > 5 b'' > 6 b'half' > 7 b'' > >>> import pandas as pd > >>> pd.DataFrame(data) > 0 > 0 b'' > 1 b'' > 2 b'' > 3 b'Foo!!' > 4 b'Bar!!' > 5 b'\x00Baz!' > 6 b'half\x00baked' > 7 b'' > >>>