arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: [Python] Accessing Azure Blob storage using arrow
Date Wed, 06 May 2020 14:21:11 GMT
I just commented about this in

https://issues.apache.org/jira/browse/ARROW-2034

Our preferred path forward would almost certainly be to build a C++
implementation of the arrow::filesystem::Filesystem interface that
deals with Azure and then that would be straightforward to hook up
with the Datasets API

On Wed, May 6, 2020 at 2:58 AM Robin Kåveland Hansen
<kaaveland@gmail.com> wrote:
>
> Hi,
>
> You're right, I want dataset functionality, I'm able to read individual
> files into memory and passing them to arrow just fine, like the example
> from the documentation.
>
> On 3 May 2020 at 00:12:48, Micah Kornfield (emkornfield@gmail.com) wrote:
>
> Hi Robin,
> I'm not an expert in this area and there has been a lot of change since I looked into
this, but I there was an old PR that looked to do a python implementation [1], as you noted
this was closed in favor of trying to target a C++ implementation.  It sounds like you may
want more data-set like functionality, but does the example given for reading from Azure in
the documentation work for you [2]?  I think there are similar APIs for parsing other file
types.
>
> Hope this helps.
>
> -Micah
>
> [1] https://github.com/apache/arrow/pull/4121
> [2] https://arrow.apache.org/docs/python/parquet.html#reading-a-parquet-file-from-azure-blob-storage
>
> On Fri, May 1, 2020 at 4:49 AM Robin Kåveland Hansen <kaaveland@gmail.com> wrote:
>>
>> Hi!
>>
>> Hadoop has builtin support for several so-called hdfs-compatible file
>> systems, including AWS S3, Azure Blob Storage, Azure Data Lake Storage
>> and Azure Data Lake gen2. Using these with hdfs commands requires a
>> little bit of setup in core-site.xml, one of the simplest possible
>> examples being:
>>
>> <property>
>>   <name>fs.azure.account.key.youraccount.blob.core.windows.net</name>
>>   <value>YOUR ACCESS KEY</value>
>> </property>
>>
>> At that point, you can issue commands like:
>>
>> hdfs dfs -ls wasbs://containername@youraccount.blob.core.windows.net
>>
>> I currently use spark to access a bunch of azure storage accounts, so I
>> already have the core-site.xml setup and thought to leverage
>> pyarrow.fs.HadoopFileSystem to be able to interact directly with these
>> file systems instead of having to put things on local storage first. I'm
>> working with hive-partitioned datasets, so there's an annoying amount of
>> "double work" in downloading only the necessary partitions.
>>
>> Creating a pyarrow.fs.HadoopFileSystem works fine, but it fails with an
>> exception like:
>>
>> IllegalArgumentException: Wrong FS: wasbs://..., expected:
>> hdfs://localhost:port
>>
>> whenever given one of the configured paths that aren't fs.defaultFS.
>>
>> Is there any way of making this work? Looks like this validation is
>> happening on the java side of the connection, so maybe there's nothing
>> that can be done in arrow?
>>
>> The other option I checked out was to extend pyarrow.fs.FileSystem to
>> write a class built on the Azure Storage SDK, but after reading the
>> pyarrow code, that seems non-trivial, since it's being passed back to
>> C++ under the hood. I'm also seeing some typechecking that seems to
>> indicate that you're not supposed to extend this API.
>>
>> That leaves the option of doing this in C++ using some SDK like
>> https://github.com/Azure/azure-storage-cpplite which is unfortunately a
>> lot more involved for me than I was hoping for when I started tumbling
>> down this particular rabbithole.
>>
>> --
>> Kind regards,
>> Robin Kåveland
>>
> --
> Vennlig hilsen,
> Robin Kåveland
>

Mime
View raw message