arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: [Python] HDFS write fails when size of file is higher than 6gb
Date Tue, 26 Jan 2021 17:34:16 GMT
It appears that writes over 2GB are implemented incorrectly.

https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/hdfs.cc#L277

the tSize type in libhdfs is an int32_t. So that static cast is truncating data

https://issues.apache.org/jira/browse/ARROW-11391

I would recommend breaking the work into smaller pieces as a workaround

On Tue, Jan 26, 2021 at 1:45 AM Сергей Красовский <krasovcheg@gmail.com>
wrote:
>
> Hello Arrow team,
>
> I have an issue with writing files with size > 6143mb to HDFS. Exception is:
>
>> Traceback (most recent call last):
>>   File "exp.py", line 22, in <module>
>>     output_stream.write(open(source, "rb").read())
>>   File "pyarrow/io.pxi", line 283, in pyarrow.lib.NativeFile.write
>>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
>> OSError: HDFS Write failed, errno: 22 (Invalid argument)
>
>
> The code below works for files with size <= 6143mb.
>
> Hadoop version: 3.1.1.3.1.4.0-315
> Python version: 3.6.10
> Pyarrow version: 2.0.0
> System: Ubuntu 16.04.7 LTS
>
> I try to understand what happens under the hood of pyarrow.lib.NativeFile.write. Is there
any limitation from pyarrow side, incompatibility with hadoop version or some settings issue
on my side.
>
> If you have any input I would highly appreciate it.
>
> The python script to upload a file:
>
>> import os
>> import pyarrow as pa
>>
>> os.environ["JAVA_HOME"]="<java_home>"
>> os.environ['ARROW_LIBHDFS_DIR'] = "<path>/libhdfs.so"
>>
>> connected = pa.hdfs.connect(host="<host>",port=8020)
>>
>> destination = "hdfs://<host>:8020/user/tmp/6144m.txt"
>> source = "/tmp/6144m.txt"
>>
>> with connected.open(destination, "wb") as output_stream:
>>     output_stream.write(open(source, "rb").read())
>>
>> connected.close()
>
>
> How to create a 6gb file:
>
>> truncate -s 6144M 6144m.txt
>
>
> Thanks a lot,
> Sergey

Mime
View raw message