arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Micah Kornfield <>
Subject Re: [Python] Accessing Azure Blob storage using arrow
Date Sat, 02 May 2020 22:12:26 GMT
Hi Robin,
I'm not an expert in this area and there has been a lot of change since I
looked into this, but I there was an old PR that looked to do a python
implementation [1], as you noted this was closed in favor of trying to
target a C++ implementation.  It sounds like you may want more data-set
like functionality, but does the example given for reading from Azure in
the documentation work for you [2]?  I think there are similar APIs for
parsing other file types.

Hope this helps.



On Fri, May 1, 2020 at 4:49 AM Robin Kåveland Hansen <>

> Hi!
> Hadoop has builtin support for several so-called hdfs-compatible file
> systems, including AWS S3, Azure Blob Storage, Azure Data Lake Storage
> and Azure Data Lake gen2. Using these with hdfs commands requires a
> little bit of setup in core-site.xml, one of the simplest possible
> examples being:
> <property>
>   <name></name>
>   <value>YOUR ACCESS KEY</value>
> </property>
> At that point, you can issue commands like:
> hdfs dfs -ls wasbs://
> I currently use spark to access a bunch of azure storage accounts, so I
> already have the core-site.xml setup and thought to leverage
> pyarrow.fs.HadoopFileSystem to be able to interact directly with these
> file systems instead of having to put things on local storage first. I'm
> working with hive-partitioned datasets, so there's an annoying amount of
> "double work" in downloading only the necessary partitions.
> Creating a pyarrow.fs.HadoopFileSystem works fine, but it fails with an
> exception like:
> IllegalArgumentException: Wrong FS: wasbs://..., expected:
> hdfs://localhost:port
> whenever given one of the configured paths that aren't fs.defaultFS.
> Is there any way of making this work? Looks like this validation is
> happening on the java side of the connection, so maybe there's nothing
> that can be done in arrow?
> The other option I checked out was to extend pyarrow.fs.FileSystem to
> write a class built on the Azure Storage SDK, but after reading the
> pyarrow code, that seems non-trivial, since it's being passed back to
> C++ under the hood. I'm also seeing some typechecking that seems to
> indicate that you're not supposed to extend this API.
> That leaves the option of doing this in C++ using some SDK like
> which is unfortunately a
> lot more involved for me than I was hoping for when I started tumbling
> down this particular rabbithole.
> --
> Kind regards,
> Robin Kåveland

View raw message