arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Сергей Красовский <krasovc...@gmail.com>
Subject [Python] HDFS write fails when size of file is higher than 6gb
Date Tue, 26 Jan 2021 07:45:14 GMT
Hello Arrow team,

I have an issue with writing files with size > 6143mb to HDFS. Exception is:

Traceback (most recent call last):
>   File "exp.py", line 22, in <module>
>     output_stream.write(open(source, "rb").read())
>   File "pyarrow/io.pxi", line 283, in pyarrow.lib.NativeFile.write
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: HDFS Write failed, errno: 22 (Invalid argument)
>

The code below works for files with size <= 6143mb.

Hadoop version: 3.1.1.3.1.4.0-315
Python version: 3.6.10
Pyarrow version: 2.0.0
System: Ubuntu 16.04.7 LTS

I try to understand what happens under the hood
of pyarrow.lib.NativeFile.write. Is there any limitation from pyarrow
side, incompatibility with hadoop version or some settings issue on my
side.

If you have any input I would highly appreciate it.

The python script to upload a file:

import os
> import pyarrow as pa
>
> os.environ["JAVA_HOME"]="<java_home>"
> os.environ['ARROW_LIBHDFS_DIR'] = "<path>/libhdfs.so"
>
> connected = pa.hdfs.connect(host="<host>",port=8020)
>
> destination = "hdfs://<host>:8020/user/tmp/6144m.txt"
> source = "/tmp/6144m.txt"
>
> with connected.open(destination, "wb") as output_stream:
>     output_stream.write(open(source, "rb").read())
>
> connected.close()
>

How to create a 6gb file:

truncate -s 6144M 6144m.txt
>

Thanks a lot,
Sergey

Mime
View raw message