arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robin Kåveland Hansen <kaavel...@gmail.com>
Subject Re: [Python] Accessing Azure Blob storage using arrow
Date Wed, 06 May 2020 07:58:45 GMT
Hi,

You're right, I want dataset functionality, I'm able to read individual
files into memory and passing them to arrow just fine, like the example
from the documentation.

On 3 May 2020 at 00:12:48, Micah Kornfield (emkornfield@gmail.com) wrote:

Hi Robin,
I'm not an expert in this area and there has been a lot of change since I
looked into this, but I there was an old PR that looked to do a python
implementation [1], as you noted this was closed in favor of trying to
target a C++ implementation.  It sounds like you may want more data-set
like functionality, but does the example given for reading from Azure in
the documentation work for you [2]?  I think there are similar APIs for
parsing other file types.

Hope this helps.

-Micah

[1] https://github.com/apache/arrow/pull/4121
[2]
https://arrow.apache.org/docs/python/parquet.html#reading-a-parquet-file-from-azure-blob-storage


On Fri, May 1, 2020 at 4:49 AM Robin Kåveland Hansen <kaaveland@gmail.com>
wrote:

> Hi!
>
> Hadoop has builtin support for several so-called hdfs-compatible file
> systems, including AWS S3, Azure Blob Storage, Azure Data Lake Storage
> and Azure Data Lake gen2. Using these with hdfs commands requires a
> little bit of setup in core-site.xml, one of the simplest possible
> examples being:
>
> <property>
>   <name>fs.azure.account.key.youraccount.blob.core.windows.net</name>
>   <value>YOUR ACCESS KEY</value>
> </property>
>
> At that point, you can issue commands like:
>
> hdfs dfs -ls wasbs://containername@youraccount.blob.core.windows.net
>
> I currently use spark to access a bunch of azure storage accounts, so I
> already have the core-site.xml setup and thought to leverage
> pyarrow.fs.HadoopFileSystem to be able to interact directly with these
> file systems instead of having to put things on local storage first. I'm
> working with hive-partitioned datasets, so there's an annoying amount of
> "double work" in downloading only the necessary partitions.
>
> Creating a pyarrow.fs.HadoopFileSystem works fine, but it fails with an
> exception like:
>
> IllegalArgumentException: Wrong FS: wasbs://..., expected:
> hdfs://localhost:port
>
> whenever given one of the configured paths that aren't fs.defaultFS.
>
> Is there any way of making this work? Looks like this validation is
> happening on the java side of the connection, so maybe there's nothing
> that can be done in arrow?
>
> The other option I checked out was to extend pyarrow.fs.FileSystem to
> write a class built on the Azure Storage SDK, but after reading the
> pyarrow code, that seems non-trivial, since it's being passed back to
> C++ under the hood. I'm also seeing some typechecking that seems to
> indicate that you're not supposed to extend this API.
>
> That leaves the option of doing this in C++ using some SDK like
> https://github.com/Azure/azure-storage-cpplite which is unfortunately a
> lot more involved for me than I was hoping for when I started tumbling
> down this particular rabbithole.
>
> --
> Kind regards,
> Robin Kåveland
>
> --
Vennlig hilsen,
Robin Kåveland

Mime
View raw message