Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of ngrice@gmail.com designates
 209.85.214.172 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CADJgEt=oaaU-3UKfsR4tet=ndXYf=sDBaUcmwATwJhLa6_Pn7Q@mail.gmail.com>
References: 
 <CADJgEt=oaaU-3UKfsR4tet=ndXYf=sDBaUcmwATwJhLa6_Pn7Q@mail.gmail.com>
Date: Mon, 26 Aug 2013 08:53:14 -0700
Message-ID: 
 <CADJgEt=nZ6ARvtQHO2LNVetLxx8BYqq44ZONkpAhzVD9CdJYeQ@mail.gmail.com>
Subject: Re: io.file.buffer.size different when not running in proper bash
 shell?
From: Nathan Grice <ngrice@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=089e013a215a10f51704e4dbc13b

--089e013a215a10f51704e4dbc13b
Content-Type: text/plain; charset=ISO-8859-1

Well, I finally solved this one on my own. Turns out the 4096B was a red
herring,it also happens to be the io write buffer in python when writing to
a file, and I was (stupidly) not flushing the buffer before trying to write
the file to hadoop. This was hard to chase down because when the python
script exited it flushed its buffer automaticallly on close of the file
handle and thus, the file size on the local fs was never 4096B (always
larger)


On Fri, Aug 23, 2013 at 5:56 PM, Nathan Grice <ngrice@gmail.com> wrote:

> Thanks in advance for any help. I have been banging my head against the
> wall on this one all day.
> When I run the cmd:
> hadoop fs -put /path/to/input /path/in/hdfs from the command line, the
> hadoop shell dutifully copies my entire file correctly, no matter the size.
>
>
> I wrote a webservice client for an external service in python and I am
> simply trying to replicate the same command after retreiving some csv
> delimited results from the webservice
>
> cmd = ['hadoop', 'fs', '-put', '/path/to/input/', '/path/in/hdfs/']
> p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE,
> bufsize=256*1024*1024)
> output, errors = p.communicate()
> if p.returncode:
>    raise OSError(errors)
> else:
>   LOG.info( output )
>
>  without fail the hadoop shell only writes the first 4096 bytes of the
> input file (which according to the documentation is the default value for
> io.file.buffer.size)
>
> I have tried almost everything including adding
> -Dio.file.buffer.size=XXXXXX where XXXXXX is a really big number and
> NOTHING seems to work.
>
> Please help!
>

--089e013a215a10f51704e4dbc13b
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Well, I finally solved this one on my own. Turns out the 4=
096B was a red herring,it also happens to be the io write buffer in python =
when writing to a file, and I was (stupidly) not flushing the buffer before=
 trying to write the file to hadoop. This was hard to chase down because wh=
en the python script exited it flushed its buffer automaticallly on close o=
f the file handle and thus, the file size on the local fs was never 4096B (=
always larger)<div>
<br></div></div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quot=
e">On Fri, Aug 23, 2013 at 5:56 PM, Nathan Grice <span dir=3D"ltr">&lt;<a h=
ref=3D"mailto:ngrice@gmail.com" target=3D"_blank">ngrice@gmail.com</a>&gt;<=
/span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Thanks in advance for any h=
elp. I have been banging my head against the wall on this one all day.<div>=
When I run the cmd:</div>
<div>hadoop fs -put /path/to/input /path/in/hdfs from the command line, the=
 hadoop shell dutifully copies my entire file correctly, no matter the size=
.</div>

<div><br></div><div><br></div><div>I wrote a webservice client for an exter=
nal service in python and I am simply trying to replicate the same command =
after retreiving some csv delimited results from the webservice</div><div>


<br></div><div><div>cmd =3D [&#39;hadoop&#39;, &#39;fs&#39;, &#39;-put&#39;=
, &#39;/path/to/input/&#39;, &#39;/path/in/hdfs/&#39;]</div><div>p =3D subp=
rocess.Popen(cmd, stdout=3Dsubprocess.PIPE, stderr=3Dsubprocess.PIPE, bufsi=
ze=3D256*1024*1024)<br>


</div><div>output, errors =3D p.communicate()</div><div>if p.returncode:</d=
iv><div>=A0 =A0raise OSError(errors)</div><div>else:</div><div><span style=
=3D"white-space:pre-wrap">=A0  </span>LOG.info( output )</div></div><div><b=
r></div>

<div>
without fail the hadoop shell only writes the first 4096 bytes of the input=
 file (which according to the documentation is the default value for io.fil=
e.buffer.size)</div><div><br></div><div>I have tried almost everything incl=
uding adding -Dio.file.buffer.size=3DXXXXXX where XXXXXX is a really big nu=
mber and NOTHING seems to work.</div>


<div><br></div><div>Please help!</div></div>
</blockquote></div><br></div>

--089e013a215a10f51704e4dbc13b--