hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nathan Grice <ngr...@gmail.com>
Subject Re: io.file.buffer.size different when not running in proper bash shell?
Date Mon, 26 Aug 2013 15:53:14 GMT
Well, I finally solved this one on my own. Turns out the 4096B was a red
herring,it also happens to be the io write buffer in python when writing to
a file, and I was (stupidly) not flushing the buffer before trying to write
the file to hadoop. This was hard to chase down because when the python
script exited it flushed its buffer automaticallly on close of the file
handle and thus, the file size on the local fs was never 4096B (always

On Fri, Aug 23, 2013 at 5:56 PM, Nathan Grice <ngrice@gmail.com> wrote:

> Thanks in advance for any help. I have been banging my head against the
> wall on this one all day.
> When I run the cmd:
> hadoop fs -put /path/to/input /path/in/hdfs from the command line, the
> hadoop shell dutifully copies my entire file correctly, no matter the size.
> I wrote a webservice client for an external service in python and I am
> simply trying to replicate the same command after retreiving some csv
> delimited results from the webservice
> cmd = ['hadoop', 'fs', '-put', '/path/to/input/', '/path/in/hdfs/']
> p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE,
> bufsize=256*1024*1024)
> output, errors = p.communicate()
> if p.returncode:
>    raise OSError(errors)
> else:
>   LOG.info( output )
>  without fail the hadoop shell only writes the first 4096 bytes of the
> input file (which according to the documentation is the default value for
> io.file.buffer.size)
> I have tried almost everything including adding
> -Dio.file.buffer.size=XXXXXX where XXXXXX is a really big number and
> NOTHING seems to work.
> Please help!

View raw message