From user-return-407-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Fri May 1 11:49:35 2020 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 4E464180634 for ; Fri, 1 May 2020 13:49:35 +0200 (CEST) Received: (qmail 54132 invoked by uid 500); 1 May 2020 11:49:34 -0000 Mailing-List: contact user-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@arrow.apache.org Delivered-To: mailing list user@arrow.apache.org Received: (qmail 54122 invoked by uid 99); 1 May 2020 11:49:34 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 01 May 2020 11:49:34 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id D32F31A318B for ; Fri, 1 May 2020 11:49:33 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.001 X-Spam-Level: X-Spam-Status: No, score=0.001 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, UNPARSEABLE_RELAY=0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-ec2-va.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id 34Zf5O9jK-tY for ; Fri, 1 May 2020 11:49:32 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=209.85.208.196; helo=mail-lj1-f196.google.com; envelope-from=kaaveland@gmail.com; receiver= Received: from mail-lj1-f196.google.com (mail-lj1-f196.google.com [209.85.208.196]) by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id 980C5BB808 for ; Fri, 1 May 2020 11:49:31 +0000 (UTC) Received: by mail-lj1-f196.google.com with SMTP id e25so2381940ljg.5 for ; Fri, 01 May 2020 04:49:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:mime-version:date:message-id:subject:to; bh=I0Cq5huvJ0LDTpgn1uYKAo56js1GAp2l0t+i4fAoeCA=; b=KBTRrGS5cIbmV2pYKNjOUznftJc5HIdS4Rfy8beOlZUFS+vEWOzE/sk7dcvPOemSuj HgmVRtuRs0pyJOhfumBFqaEAD19wC7Gw5kx4puNqrR9O2WHp+i1l5y+eHPTvKGBIz+IE lFy/dV8AkgVAJJ9jDZtcseX9dVEwBb/Jteo9pjUCyABOUNGLEmihUFqO57+wFxuSrM54 JxdUIuLIM2LiAB+VAPJVH1rwXW3lwUD8IP8wANbs6SyC/btMXbpqR58kemqTAvVryyTA 9du3EiVXU8u2XVnt07JFOhu9A5S5VFYWiC6P5uJ/kxF26riTtxiG7dCAR92VV2XSOKI8 vIrQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:mime-version:date:message-id:subject:to; bh=I0Cq5huvJ0LDTpgn1uYKAo56js1GAp2l0t+i4fAoeCA=; b=CRHzSQ4V/C+/yJ7fFAF1MMJtssTq4UW2hZOxxdKHmOcXZ0pnnNoz/nKt3WmMCAt9zY 2uZmIkFtvVbIENCP7iaI9t+qRISlyA1CKdZAnrWVnUyfTNQC+JDtOW9DeS/PuMDEt2Rb dvtZzoQD3K4xQxzfP4eUsAs9rpxQKvSXhaYAtfOXBVkrMsisiam4BhqMhLQuHrIVVNo0 LRY5dZyr0BhkyVn4Ai917iNIoMkuElS9kiPP+3ZVurHTZ790SVRzgFPT7vjIMaIyv3p0 WEE4vLxYrGOxOrB925MGAg29TX8y6HOigkjDXMHdo30j8x5pgxrweqo6qCGqGRwviqaS S7Aw== X-Gm-Message-State: AGi0PuZpzwOlRhtEvGLMpCE9R+AeAuTRJOfz6Ie5O6o36Ttkglmbk13c AyMEGG7ehcq1NBA47BIHtTOA3gvcu5bdGjv87egL8fVY X-Google-Smtp-Source: APiQypK8EZ0cqd/NylrEZDwzdhhz7OAUdI3Ef7R3KwE3/2AnR93jUnRm0Z1xU3UkOJ/JXTHNvY1ZWd6m3ZMx0MuGf2Y= X-Received: by 2002:a2e:90cd:: with SMTP id o13mr2333015ljg.220.1588333769910; Fri, 01 May 2020 04:49:29 -0700 (PDT) Received: from 1058052472880 named unknown by gmailapi.google.com with HTTPREST; Fri, 1 May 2020 04:49:28 -0700 From: =?UTF-8?Q?Robin_K=C3=A5veland_Hansen?= MIME-Version: 1.0 Date: Fri, 1 May 2020 04:49:28 -0700 Message-ID: Subject: [Python] Accessing Azure Blob storage using arrow To: user@arrow.apache.org Content-Type: multipart/alternative; boundary="00000000000023ede005a494c667" --00000000000023ede005a494c667 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi! Hadoop has builtin support for several so-called hdfs-compatible file systems, including AWS S3, Azure Blob Storage, Azure Data Lake Storage and Azure Data Lake gen2. Using these with hdfs commands requires a little bit of setup in core-site.xml, one of the simplest possible examples being: fs.azure.account.key.youraccount.blob.core.windows.net YOUR ACCESS KEY At that point, you can issue commands like: hdfs dfs -ls wasbs://containername@youraccount.blob.core.windows.net I currently use spark to access a bunch of azure storage accounts, so I already have the core-site.xml setup and thought to leverage pyarrow.fs.HadoopFileSystem to be able to interact directly with these file systems instead of having to put things on local storage first. I'm working with hive-partitioned datasets, so there's an annoying amount of "double work" in downloading only the necessary partitions. Creating a pyarrow.fs.HadoopFileSystem works fine, but it fails with an exception like: IllegalArgumentException: Wrong FS: wasbs://..., expected: hdfs://localhost:port whenever given one of the configured paths that aren't fs.defaultFS. Is there any way of making this work? Looks like this validation is happening on the java side of the connection, so maybe there's nothing that can be done in arrow? The other option I checked out was to extend pyarrow.fs.FileSystem to write a class built on the Azure Storage SDK, but after reading the pyarrow code, that seems non-trivial, since it's being passed back to C++ under the hood. I'm also seeing some typechecking that seems to indicate that you're not supposed to extend this API. That leaves the option of doing this in C++ using some SDK like https://github.com/Azure/azure-storage-cpplite which is unfortunately a lot more involved for me than I was hoping for when I started tumbling down this particular rabbithole. --=20 Kind regards, Robin K=C3=A5veland --00000000000023ede005a494c667 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable =
Hi!
<= div style=3D"margin:0px">
Hadoop has bui= ltin support for several so-called hdfs-compatible file
systems, including AWS S3, Azure Blob Storage, Azure Data Lake = Storage
and Azure Data Lake gen2. Using thes= e with hdfs commands requires a
little bit o= f setup in core-site.xml, one of the simplest possible
examples being:

<property>
=C2=A0 <value>YOUR ACCESS KEY&l= t;/value>
</property>

At that point, you ca= n issue commands like:


I = currently use spark to access a bunch of azure storage accounts, so I
=
already have the core-site.xml setup and thought = to leverage
pyarrow.fs.HadoopFileSystem to b= e able to interact directly with these
file = systems instead of having to put things on local storage first. I'm
working with hive-partitioned datasets, so ther= e's an annoying amount of
"double w= ork" in downloading only the necessary partitions.

Creating a pyarrow.fs.Hadoo= pFileSystem works fine, but it fails with an
exception like:

IllegalArgumentException: Wrong FS: wasbs://..., expected:
hdfs://localhost:port

whenever given one of the configured = paths that aren't fs.defaultFS.

Is there any way of making this work? Looks lik= e this validation is
happening on the java s= ide of the connection, so maybe there's nothing
that can be done in arrow?=C2=A0
The other option I checked out was to exte= nd pyarrow.fs.FileSystem to
write a class bu= ilt on the Azure Storage SDK, but after reading the
pyarrow code, that seems non-trivial, since it's being passed b= ack to
C++ under the hood. I'm also seei= ng some typechecking that seems to
indicate = that you're not supposed to extend this API.

That leaves the option of doing th= is in C++ using some SDK like
lot more involved for me than I was hoping for when I started tumbling
down this particular rabbithole.

--=C2=A0
Kind regards,
Robin K=C3=A5= veland

--00000000000023ede005a494c667--