arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <>
Subject Re: [Python] Accessing Azure Blob storage using arrow
Date Wed, 06 May 2020 14:21:11 GMT
I just commented about this in

Our preferred path forward would almost certainly be to build a C++
implementation of the arrow::filesystem::Filesystem interface that
deals with Azure and then that would be straightforward to hook up
with the Datasets API

On Wed, May 6, 2020 at 2:58 AM Robin Kåveland Hansen
<> wrote:
> Hi,
> You're right, I want dataset functionality, I'm able to read individual
> files into memory and passing them to arrow just fine, like the example
> from the documentation.
> On 3 May 2020 at 00:12:48, Micah Kornfield ( wrote:
> Hi Robin,
> I'm not an expert in this area and there has been a lot of change since I looked into
this, but I there was an old PR that looked to do a python implementation [1], as you noted
this was closed in favor of trying to target a C++ implementation.  It sounds like you may
want more data-set like functionality, but does the example given for reading from Azure in
the documentation work for you [2]?  I think there are similar APIs for parsing other file
> Hope this helps.
> -Micah
> [1]
> [2]
> On Fri, May 1, 2020 at 4:49 AM Robin Kåveland Hansen <> wrote:
>> Hi!
>> Hadoop has builtin support for several so-called hdfs-compatible file
>> systems, including AWS S3, Azure Blob Storage, Azure Data Lake Storage
>> and Azure Data Lake gen2. Using these with hdfs commands requires a
>> little bit of setup in core-site.xml, one of the simplest possible
>> examples being:
>> <property>
>>   <name></name>
>>   <value>YOUR ACCESS KEY</value>
>> </property>
>> At that point, you can issue commands like:
>> hdfs dfs -ls wasbs://
>> I currently use spark to access a bunch of azure storage accounts, so I
>> already have the core-site.xml setup and thought to leverage
>> pyarrow.fs.HadoopFileSystem to be able to interact directly with these
>> file systems instead of having to put things on local storage first. I'm
>> working with hive-partitioned datasets, so there's an annoying amount of
>> "double work" in downloading only the necessary partitions.
>> Creating a pyarrow.fs.HadoopFileSystem works fine, but it fails with an
>> exception like:
>> IllegalArgumentException: Wrong FS: wasbs://..., expected:
>> hdfs://localhost:port
>> whenever given one of the configured paths that aren't fs.defaultFS.
>> Is there any way of making this work? Looks like this validation is
>> happening on the java side of the connection, so maybe there's nothing
>> that can be done in arrow?
>> The other option I checked out was to extend pyarrow.fs.FileSystem to
>> write a class built on the Azure Storage SDK, but after reading the
>> pyarrow code, that seems non-trivial, since it's being passed back to
>> C++ under the hood. I'm also seeing some typechecking that seems to
>> indicate that you're not supposed to extend this API.
>> That leaves the option of doing this in C++ using some SDK like
>> which is unfortunately a
>> lot more involved for me than I was hoping for when I started tumbling
>> down this particular rabbithole.
>> --
>> Kind regards,
>> Robin Kåveland
> --
> Vennlig hilsen,
> Robin Kåveland

View raw message