Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3A05DE8E3 for ; Wed, 20 Feb 2013 18:40:26 +0000 (UTC) Received: (qmail 80599 invoked by uid 500); 20 Feb 2013 18:40:21 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 80330 invoked by uid 500); 20 Feb 2013 18:40:21 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 80323 invoked by uid 99); 20 Feb 2013 18:40:21 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 20 Feb 2013 18:40:21 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of harsh@cloudera.com designates 209.85.223.171 as permitted sender) Received: from [209.85.223.171] (HELO mail-ie0-f171.google.com) (209.85.223.171) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 20 Feb 2013 18:40:15 +0000 Received: by mail-ie0-f171.google.com with SMTP id 10so10288457ied.2 for ; Wed, 20 Feb 2013 10:39:54 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:mime-version:in-reply-to:references:from:date:message-id :subject:to:content-type:content-transfer-encoding :x-gm-message-state; bh=TQdk7DDIqacmIxevHMXHCd65NGPeclWg+6QBulrTNy8=; b=DFd4yH6yW4ibWGrhEN7XkPKNflXx1tep/TnoVfC8Tb4UwbbhhgAoLR0BjVObnF6+I4 mMfaFZDHp6aO9B339wBydukzrNcc7pUYbVN5o0Hz3IoLONyra3w3FMlWhQQDhjMYzyAU v50QSk9nOp+kn/08Ka0TNPJHyEZQsEcBXrCOJ/67BSpuQfx7LGatOYON7wXONIvjUE1G pdx1wzEoCJwPd6h+t5/eTgdVe5FdY0ZKaGNycyjtZUtRrHfLXwd4ERBm9guzVVw331vq ijaVWOlN9j+9W8SgMDjhK5c56Jhdp/hKbOlO6tU/qVwpMHHrgt8eP0ci58Ls+4T3/YRd MkLg== X-Received: by 10.50.37.239 with SMTP id b15mr11202069igk.69.1361385593864; Wed, 20 Feb 2013 10:39:53 -0800 (PST) MIME-Version: 1.0 Received: by 10.50.104.229 with HTTP; Wed, 20 Feb 2013 10:39:33 -0800 (PST) In-Reply-To: References: From: Harsh J Date: Thu, 21 Feb 2013 00:09:33 +0530 Message-ID: Subject: Re: copy chunk of hadoop output To: "" Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Gm-Message-State: ALoCoQkx7Mo5SmbQlWVFNNRtuh7B0NtJcT35/baRNqix5FdZHwSAJGdUYlMktDRyN2krYhs4dmTW X-Virus-Checked: Checked by ClamAV on apache.org Hi JM, I am not sure how "dangerous" it is, since we're using a pipe here, and as you yourself note, it will only last as long as the last bytes have been got and then terminate. The -cat process will terminate because the process we're piping to will terminate first after it reaches its goal of -c ; so certainly the "-cat" program will not fetch the whole file down but it may fetch a few bytes extra over communication due to use of read buffers (the extra data won't be put into the target file, and get discarded). We can try it out and observe the "clienttrace" logged at the DN at the end of the -cat's read. Here's an example: I wrote a 1.6~ MB file into a file called "foo.jar", see "bytes" below, its ~1.58 MB: 2013-02-20 23:55:19,777 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /127.0.0.1:58785, dest: /127.0.0.1:50010, bytes: 1658314, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_915204057_1, offset: 0, srvID: DS-1092147940-192.168.2.1-50010-1349279636946, blockid: BP-1461691939-192.168.2.1-1349279623549:blk_2568668834545125596_73870, duration: 192289000 I ran the command "hadoop fs -cat foo.jar | head -c 5 > foo.xml" to store first 5 bytes onto a local file: Asserting that post command we get 5 bytes: =E2=9E=9C ~ wc -c foo.xml 5 foo.xml Asserting that DN didn't IO-read the whole file, see the read op below and its "bytes" parameter, its only about 193 KB, not the whole block of 1.58 MB we wrote earlier: 2013-02-21 00:01:32,437 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /127.0.0.1:50010, dest: /127.0.0.1:58802, bytes: 198144, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-1698829178_1, offset: 0, srvID: DS-1092147940-192.168.2.1-50010-1349279636946, blockid: BP-1461691939-192.168.2.1-1349279623549:blk_2568668834545125596_73870, duration: 19207000 I don't see how this is anymore dangerous than doing a -copyToLocal/-get, which retrieves the whole file anyway? On Wed, Feb 20, 2013 at 9:25 PM, Jean-Marc Spaggiari wrote: > But be careful. > > hadoop fs -cat will retrieve the entire file and last only when it > will have retrieve the last bytes you are looking for. > > If your file is many GB big, it will take a lot of time for this > command to complete and will put some pressure on your network. > > JM > > 2013/2/19, jamal sasha : >> Awesome thanks :) >> >> >> On Tue, Feb 19, 2013 at 2:14 PM, Harsh J wrote: >> >>> You can instead use 'fs -cat' and the 'head' coreutil, as one example: >>> >>> hadoop fs -cat 100-byte-dfs-file | head -c 5 > 5-byte-local-file >>> >>> On Wed, Feb 20, 2013 at 3:38 AM, jamal sasha >>> wrote: >>> > Hi, >>> > I was wondering in the following command: >>> > >>> > bin/hadoop dfs -copyToLocal hdfspath localpath >>> > can we have specify to copy not full but like xMB's of file to local >>> drive? >>> > >>> > Is something like this possible >>> > Thanks >>> > Jamal >>> >>> >>> >>> -- >>> Harsh J >>> >> -- Harsh J