Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3CF3310FFD for ; Sat, 25 Jan 2014 14:29:14 +0000 (UTC) Received: (qmail 31202 invoked by uid 500); 25 Jan 2014 14:29:06 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 30686 invoked by uid 500); 25 Jan 2014 14:29:05 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 30677 invoked by uid 99); 25 Jan 2014 14:29:04 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 25 Jan 2014 14:29:04 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of rabmdu@gmail.com designates 209.85.220.196 as permitted sender) Received: from [209.85.220.196] (HELO mail-vc0-f196.google.com) (209.85.220.196) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 25 Jan 2014 14:28:58 +0000 Received: by mail-vc0-f196.google.com with SMTP id lf12so711454vcb.11 for ; Sat, 25 Jan 2014 06:28:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=pvo1nYP8IyyMBJL/Q+09q4tW7m7k8nIoWV0JXO9wjDQ=; b=yJPOtUK3CcK3nPy486jCh+jC99tGitrwAX7VMBvmzex5KPYeJ9EYlyKFM+sqMrLntH Lo6j7caDPZi00px3g89IoXz5IKLgL38plYbPoRecaW8Mx826k0GjXoGOY1sk7FwRvze9 PrgePNcJIPIVvd/vS+awKAcQNrIxtX7dFNfXnW85mSUbv7DGTcTIopQoo89mw6VzuzSg maRTzDkQZPdeiIn7JX62ZxavBK9E3TTppVr1N576zVEUuF9LWDg71MBYRy0u+zZd95in tRgbXa7ve/r/NsXKL93WwuP19COQRxLL7xGZOyQFH7m5SqFUpnQ77TRO3vcx5YFbOp0E ad/A== MIME-Version: 1.0 X-Received: by 10.58.201.169 with SMTP id kb9mr331503vec.42.1390660117650; Sat, 25 Jan 2014 06:28:37 -0800 (PST) Received: by 10.220.131.193 with HTTP; Sat, 25 Jan 2014 06:28:37 -0800 (PST) Received: by 10.220.131.193 with HTTP; Sat, 25 Jan 2014 06:28:37 -0800 (PST) In-Reply-To: <869970D71E26D7498BDAC4E1CA92226B86E1F075@MBX021-E3-NJ-2.exch021.domain.local> References: <869970D71E26D7498BDAC4E1CA92226B86E1F075@MBX021-E3-NJ-2.exch021.domain.local> Date: Sat, 25 Jan 2014 19:58:37 +0530 Message-ID: Subject: RE: HDFS data transfer is faster than SCP based transfer? From: rab ra To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=047d7bd6b80050e0a904f0cc4afb X-Virus-Checked: Checked by ClamAV on apache.org --047d7bd6b80050e0a904f0cc4afb Content-Type: text/plain; charset=ISO-8859-1 The input files are provided as argument to a binary being executed by map process. This binary cannot read from hdfs and i cant rewrite it. On 25 Jan 2014 19:47, "John Lilley" wrote: > There are no short-circuit writes, only reads, AFAIK. > > Is it necessary to transfer from HDFS to local disk? Can you read from > HDFS directly using the FileSystem interface? > > john > > > > *From:* Shekhar Sharma [mailto:shekhar2581@gmail.com] > *Sent:* Saturday, January 25, 2014 3:44 AM > *To:* user@hadoop.apache.org > *Subject:* Re: HDFS data transfer is faster than SCP based transfer? > > > > We have the concept of short circuit reads which directly reads from data > node which improve read performance. Do we have similar concept like short > circuit writes > > On 25 Jan 2014 16:10, "Harsh J" wrote: > > There's a lot of difference here, although both do use TCP underneath, > but do note that SCP securely encrypts data but stock HDFS > configuration does not. > > You can also ask SCP to compress data transfer via the "-C" argument > btw - unsure if you already applied that pre-test - it may help show > up some difference. Also, the encryption algorithm can be changed to a > weaker one if security is not a concern during the transfer, via "-c > arcfour". > > On Fri, Jan 24, 2014 at 10:55 AM, rab ra wrote: > > Hello > > > > I have a use case that requires transfer of input files from remote > storage > > using SCP protocol (using jSCH jar). To optimize this use case, I have > > pre-loaded all my input files into HDFS and modified my use case so that > it > > copies required files from HDFS. So, when tasktrackers works, it copies > > required number of input files to its local directory from HDFS. All my > > tasktrackers are also datanodes. I could see my use case has run faster. > The > > only modification in my application is that file copy from HDFS instead > of > > transfer using SCP. Also, my use case involves parallel operations (run > in > > tasktrackers) and they do lot of file transfer. Now all these transfers > are > > replaced with HDFS copy. > > > > Can anyone tell me HDFS transfer is faster as I witnessed? Is it > because, it > > uses TCP/IP? Can anyone give me reasonable reasons to support the > decrease > > of time? > > > > > > with thanks and regards > > rab > > > > -- > Harsh J > --047d7bd6b80050e0a904f0cc4afb Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable

The input files are provided as argument to a binary being executed by m= ap process. This binary cannot read from hdfs and i cant rewrite it.

On 25 Jan 2014 19:47, "John Lilley" &l= t;john.lilley@redpoint.net&= gt; wrote:

There are no short-circui= t writes, only reads, AFAIK.

Is it necessary to transf= er from HDFS to local disk?=A0 Can you read from HDFS directly using the Fi= leSystem interface?

john=

=A0<= /p>

From: Shekhar = Sharma [mailto:s= hekhar2581@gmail.com]
Sent: Saturday, January 25, 2014 3:44 AM
To: user= @hadoop.apache.org
Subject: Re: HDFS data transfer is faster than SCP based transfer?

=A0

We have the concept of short circuit reads which directly reads from dat= a node which improve read performance. Do we have similar concept like shor= t circuit writes

On 25 Jan 2014 16:10, "Harsh J" <harsh@cloudera.com>= ; wrote:

There's a lot of difference here, although both = do use TCP underneath,
but do note that SCP securely encrypts data but stock HDFS
configuration does not.

You can also ask SCP to compress data transfer via the "-C" argum= ent
btw - unsure if you already applied that pre-test - it may help show
up some difference. Also, the encryption algorithm can be changed to a
weaker one if security is not a concern during the transfer, via "-c arcfour".

On Fri, Jan 24, 2014 at 10:55 AM, rab ra <rabmdu@gmail.com> wrote:
> Hello
>
> I have a use case that requires transfer of input files from remote st= orage
> using SCP protocol (using jSCH jar). =A0To optimize this use case, I h= ave
> pre-loaded all my input files into HDFS and modified my use case so th= at it
> copies required files from HDFS. So, when tasktrackers works, it copie= s
> required number of input files to its local directory from HDFS. All m= y
> tasktrackers are also datanodes. I could see my use case has run faste= r. The
> only modification in my application is that file copy from HDFS instea= d of
> transfer using SCP. Also, my use case involves parallel operations (ru= n in
> tasktrackers) and they do lot of file transfer. Now all these transfer= s are
> replaced with HDFS copy.
>
> Can anyone tell me HDFS transfer is faster as I witnessed? Is it becau= se, it
> uses TCP/IP? Can anyone give me reasonable reasons to support the decr= ease
> of time?
>
>
> with thanks and regards
> rab



--
Harsh J

--047d7bd6b80050e0a904f0cc4afb--