arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe L. Korn" <>
Subject Re: PyArrow connect to Azure Data Lake Gen2
Date Wed, 17 Jul 2019 11:55:11 GMT

you don't to go through HDFS or Java to access ADLS Gen 2. This is simply an improved API
for Azure Storage Blob and thus you can use the blob APIs of
to access the relevant containers. I've previously used
and `pyarrow.parquet` to read Parquet files reliably from there.


On Wed, Jul 17, 2019, at 11:10 AM, Игорь Кравченко wrote:
> Hello,
> Yesterday I have opened an issue on GitHub, and I have received an advice to ask this
question here, link to the issue :
> Generally I have Storage Account in Azure and a virtual machine, from which I want to
connect to Data Lake, and I am trying to do that with PyArrow. As I wrote I was trying to
access it using different drivers - "libhdfs" and "libhdfs3", and constantly getting the same
error - Timeout. One of the options of authorization to Storage is Shared Key, and when I
was using terminal commands "hdfs" I worked just fine, scripts that I used : *hdfs dfs -get
abfss:// /home/adminello/files/*
> *ABFSS - is special driver, created to access ALDS Gen2, which I currently have*
> *(*,
> and hdfs works with this driver only from 3.2.0 version, BUT in folder hadoop-3.2.0/lib/native
there was a lack of file "libhdfs", which really surprised me, because in hadoop-3.1.2 it
was present, although it doesn't work with *abfss.* Just details, maybe You need them.
> But my company wants to access it through Python, because of further Data Analysis, which
is relatively easy to make using Pandas, at that point I started trying to connect via Python.
> I didn't find any example in google, that's why I am not sure if my script is correct,
I made lots of combinations, changing ports, replacing file system name, adding "https" and
"abfss" in the beginning of hostname and at this point I stuck, maybe Somebody can help me
> Thanks

View raw message