Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 14D6310818 for ; Mon, 26 Aug 2013 15:53:53 +0000 (UTC) Received: (qmail 46323 invoked by uid 500); 26 Aug 2013 15:53:46 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 46195 invoked by uid 500); 26 Aug 2013 15:53:43 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 46188 invoked by uid 99); 26 Aug 2013 15:53:42 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 26 Aug 2013 15:53:42 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ngrice@gmail.com designates 209.85.214.172 as permitted sender) Received: from [209.85.214.172] (HELO mail-ob0-f172.google.com) (209.85.214.172) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 26 Aug 2013 15:53:36 +0000 Received: by mail-ob0-f172.google.com with SMTP id er7so3443350obc.17 for ; Mon, 26 Aug 2013 08:53:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=Ob5CIjA4x0Hm41n03ffMF4eGfzGm+1aVQ2em/v9hF+s=; b=ohUAggQ5DyRee5p7wwxBHu+6vrdzDjkmfRtkDTSs5C68X+Jv4WRn8i6qLC8c1xBphq NFvJw4ebe5lGMwsNfUQprTtq+inr2pFweFinfynpuU5XOpB7vUAbG2k0RuMthIdaBVyJ /O3nsSxsvbf9vw7LiOd+9P+yvqbieaflU8y+i8xDfJ5OOfbt7Uok6Yo2cNBFT3vO1bFh E/GNwHJ/EDYlcu+jDUSaDIePP0fu0xM+PHuVD2rKxzrq4GxClKiQUF0mzufrmbG7x0bU VWIn0CJIp1TWzVhSYWgkbcRy3knzRHTUWm8OTavpln8Vhcnie7jAZFlQ7ZWamojClsrG kYyw== MIME-Version: 1.0 X-Received: by 10.182.104.36 with SMTP id gb4mr3100137obb.43.1377532394925; Mon, 26 Aug 2013 08:53:14 -0700 (PDT) Received: by 10.60.56.116 with HTTP; Mon, 26 Aug 2013 08:53:14 -0700 (PDT) In-Reply-To: References: Date: Mon, 26 Aug 2013 08:53:14 -0700 Message-ID: Subject: Re: io.file.buffer.size different when not running in proper bash shell? From: Nathan Grice To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=089e013a215a10f51704e4dbc13b X-Virus-Checked: Checked by ClamAV on apache.org --089e013a215a10f51704e4dbc13b Content-Type: text/plain; charset=ISO-8859-1 Well, I finally solved this one on my own. Turns out the 4096B was a red herring,it also happens to be the io write buffer in python when writing to a file, and I was (stupidly) not flushing the buffer before trying to write the file to hadoop. This was hard to chase down because when the python script exited it flushed its buffer automaticallly on close of the file handle and thus, the file size on the local fs was never 4096B (always larger) On Fri, Aug 23, 2013 at 5:56 PM, Nathan Grice wrote: > Thanks in advance for any help. I have been banging my head against the > wall on this one all day. > When I run the cmd: > hadoop fs -put /path/to/input /path/in/hdfs from the command line, the > hadoop shell dutifully copies my entire file correctly, no matter the size. > > > I wrote a webservice client for an external service in python and I am > simply trying to replicate the same command after retreiving some csv > delimited results from the webservice > > cmd = ['hadoop', 'fs', '-put', '/path/to/input/', '/path/in/hdfs/'] > p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, > bufsize=256*1024*1024) > output, errors = p.communicate() > if p.returncode: > raise OSError(errors) > else: > LOG.info( output ) > > without fail the hadoop shell only writes the first 4096 bytes of the > input file (which according to the documentation is the default value for > io.file.buffer.size) > > I have tried almost everything including adding > -Dio.file.buffer.size=XXXXXX where XXXXXX is a really big number and > NOTHING seems to work. > > Please help! > --089e013a215a10f51704e4dbc13b Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Well, I finally solved this one on my own. Turns out the 4= 096B was a red herring,it also happens to be the io write buffer in python = when writing to a file, and I was (stupidly) not flushing the buffer before= trying to write the file to hadoop. This was hard to chase down because wh= en the python script exited it flushed its buffer automaticallly on close o= f the file handle and thus, the file size on the local fs was never 4096B (= always larger)



On Fri, Aug 23, 2013 at 5:56 PM, Nathan Grice <ngrice@gmail.com><= /span> wrote:
Thanks in advance for any h= elp. I have been banging my head against the wall on this one all day.
= When I run the cmd:
hadoop fs -put /path/to/input /path/in/hdfs from the command line, the= hadoop shell dutifully copies my entire file correctly, no matter the size= .


I wrote a webservice client for an exter= nal service in python and I am simply trying to replicate the same command = after retreiving some csv delimited results from the webservice

cmd =3D ['hadoop', 'fs', '-put'= , '/path/to/input/', '/path/in/hdfs/']
p =3D subp= rocess.Popen(cmd, stdout=3Dsubprocess.PIPE, stderr=3Dsubprocess.PIPE, bufsi= ze=3D256*1024*1024)
output, errors =3D p.communicate()
if p.returncode:
=A0 =A0raise OSError(errors)
else:
=A0 LOG.info( output )
without fail the hadoop shell only writes the first 4096 bytes of the input= file (which according to the documentation is the default value for io.fil= e.buffer.size)

I have tried almost everything incl= uding adding -Dio.file.buffer.size=3DXXXXXX where XXXXXX is a really big nu= mber and NOTHING seems to work.

Please help!

--089e013a215a10f51704e4dbc13b--