arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Игорь Кравченко <awpkraban...@gmail.com>
Subject PyArrow connect to Azure Data Lake Gen2
Date Wed, 17 Jul 2019 09:09:59 GMT
Hello,
Yesterday I have opened an issue on GitHub, and I have received an advice
to ask this question here, link to the issue :
https://github.com/apache/arrow/issues/4888
Generally I have Storage Account in Azure and a virtual machine, from which
I want to connect to Data Lake, and I am trying to do that with PyArrow. As
I wrote I was trying to access it using different drivers - "libhdfs" and
"libhdfs3", and constantly getting the same error - Timeout. One of the
options of authorization to Storage is Shared Key, and when I was using
terminal commands "hdfs" I worked just fine, scripts that I used : *hdfs
dfs -get
abfss://lotosanalysis@fordatalakeaccount.dfs.core.windows.net/new-direc
<http://lotosanalysis@fordatalakeaccount.dfs.core.windows.net/new-direc>
/home/adminello/files/*
*ABFSS - is special driver, created to access ALDS Gen2, which I currently
have*
*(*
https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-abfs-driver
),
and hdfs works with this driver only from 3.2.0 version, BUT in folder
hadoop-3.2.0/lib/native there was a lack of file "libhdfs", which really
surprised me, because in hadoop-3.1.2 it was present, although it doesn't
work with *abfss.* Just details, maybe You need them.

But my company wants to access it through Python, because of further Data
Analysis, which is relatively easy to make using Pandas, at that point I
started trying to connect via Python.
I didn't find any example in google, that's why I am not sure if my script
is correct, I made lots of combinations, changing ports, replacing file
system name, adding "https" and "abfss" in the beginning of hostname and at
this point I stuck, maybe Somebody can help me please.
Thanks

Mime
View raw message