Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D7A5510C18 for ; Sat, 25 Jan 2014 10:17:50 +0000 (UTC) Received: (qmail 26518 invoked by uid 500); 25 Jan 2014 10:17:41 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 26197 invoked by uid 500); 25 Jan 2014 10:17:40 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 26190 invoked by uid 99); 25 Jan 2014 10:17:40 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 25 Jan 2014 10:17:40 +0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of shekhar2581@gmail.com designates 209.85.216.175 as permitted sender) Received: from [209.85.216.175] (HELO mail-qc0-f175.google.com) (209.85.216.175) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 25 Jan 2014 10:17:34 +0000 Received: by mail-qc0-f175.google.com with SMTP id x13so5761602qcv.34 for ; Sat, 25 Jan 2014 02:17:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=LBy4JmmyIukHlO+ZUe1zYDGmIYDgHPOnestyiIj9THI=; b=ppSlgtXVUK+BNIJiTaHJVEprNbIyuVJmWHtkBP4wESbxzrT+xo83Rt8IZHQzCvyym9 PMl1tXJgLInV+o73HCrSVw1FhF/6XYWrjSb6Xv1pE5c6T2EawoCpSFqLbsuSyfrad1bY hyLDPcldY675N4dVituj12e6FPkKntexdGnLhS57XAqnyjKOgXB7E79owDvkSd8muUw+ SWkC5ryNY0JiMCmZBsny31SILOOE0rE9KujQmi2enQoSmdNLcFHD/SBVMXzcGaxRaQZF wAwYrDcD27vJdxdOC87HZraqaoH2NOPTqdkkz4GmeELsdp496ZmfKDDP5ls/SJajbN1b Q/DA== MIME-Version: 1.0 X-Received: by 10.224.122.208 with SMTP id m16mr27493569qar.55.1390645033840; Sat, 25 Jan 2014 02:17:13 -0800 (PST) Received: by 10.96.65.99 with HTTP; Sat, 25 Jan 2014 02:17:13 -0800 (PST) Received: by 10.96.65.99 with HTTP; Sat, 25 Jan 2014 02:17:13 -0800 (PST) In-Reply-To: References: Date: Sat, 25 Jan 2014 15:47:13 +0530 Message-ID: Subject: Re: HDFS data transfer is faster than SCP based transfer? From: Shekhar Sharma To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=089e0149cc0e4034f404f0c8c7fa X-Virus-Checked: Checked by ClamAV on apache.org --089e0149cc0e4034f404f0c8c7fa Content-Type: text/plain; charset=ISO-8859-1 WHEN u put the data or write into HDFS, 64kb of data is written on client side and then it is pushed through pipeline and this process continue till 64mb of data is written which is the block size defined by the client. While on the other hand scp will try to buffer the entire data. Passing chunks of data would be faster than passing larger data. Please check how writing happen in HDFS. That will give you clear picture On 24 Jan 2014 10:56, "rab ra" wrote: > Hello > > I have a use case that requires transfer of input files from remote > storage using SCP protocol (using jSCH jar). To optimize this use case, I > have pre-loaded all my input files into HDFS and modified my use case so > that it copies required files from HDFS. So, when tasktrackers works, it > copies required number of input files to its local directory from HDFS. All > my tasktrackers are also datanodes. I could see my use case has run faster. > The only modification in my application is that file copy from HDFS instead > of transfer using SCP. Also, my use case involves parallel operations (run > in tasktrackers) and they do lot of file transfer. Now all these transfers > are replaced with HDFS copy. > > Can anyone tell me HDFS transfer is faster as I witnessed? Is it because, > it uses TCP/IP? Can anyone give me reasonable reasons to support the > decrease of time? > > > with thanks and regards > rab > --089e0149cc0e4034f404f0c8c7fa Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable

WHEN u put the data or write into HDFS, 64kb of data is writ= ten on client side and then it is pushed through pipeline and this process = continue till 64mb of data is written which is the block size defined by th= e client.

While on the other hand scp will try to buffer the entire da= ta. Passing chunks of data would be faster than passing larger data.

Please check how writing happen in HDFS. That will give you = clear picture

On 24 Jan 2014 10:56, "rab ra" <rabmdu@gmail.com> wrote:
Hello
=A0
I have a use case that requires transfer of input files from remote st= orage using SCP protocol (using jSCH jar).=A0 To optimize this use=A0case, = I have pre-loaded all my input files into HDFS and modified my use case so = that it copies required files from HDFS. So, when tasktrackers works, it co= pies required number of input=A0files=A0to=A0its local directory from HDFS.= =A0All my tasktrackers are also datanodes. I could see my use case has run = faster. The only modification in my application is that file copy from HDFS= instead of transfer using SCP. Also, my use case involves parallel operati= ons (run in tasktrackers) and they do lot of file transfer. Now all these t= ransfers are replaced with HDFS copy.
=A0
Can anyone tell me HDFS transfer is faster as I witnessed? Is it becau= se, it uses TCP/IP? Can anyone give me reasonable reasons to support the de= crease of time?
=A0
=A0
with thanks and regards
rab
--089e0149cc0e4034f404f0c8c7fa--