hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From rab ra <rab...@gmail.com>
Subject RE: HDFS data transfer is faster than SCP based transfer?
Date Sat, 25 Jan 2014 14:28:37 GMT
The input files are provided as argument to a binary being executed by map
process. This binary cannot read from hdfs and i cant rewrite it.
On 25 Jan 2014 19:47, "John Lilley" <john.lilley@redpoint.net> wrote:

>  There are no short-circuit writes, only reads, AFAIK.
>
> Is it necessary to transfer from HDFS to local disk?  Can you read from
> HDFS directly using the FileSystem interface?
>
> john
>
>
>
> *From:* Shekhar Sharma [mailto:shekhar2581@gmail.com]
> *Sent:* Saturday, January 25, 2014 3:44 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: HDFS data transfer is faster than SCP based transfer?
>
>
>
> We have the concept of short circuit reads which directly reads from data
> node which improve read performance. Do we have similar concept like short
> circuit writes
>
> On 25 Jan 2014 16:10, "Harsh J" <harsh@cloudera.com> wrote:
>
> There's a lot of difference here, although both do use TCP underneath,
> but do note that SCP securely encrypts data but stock HDFS
> configuration does not.
>
> You can also ask SCP to compress data transfer via the "-C" argument
> btw - unsure if you already applied that pre-test - it may help show
> up some difference. Also, the encryption algorithm can be changed to a
> weaker one if security is not a concern during the transfer, via "-c
> arcfour".
>
> On Fri, Jan 24, 2014 at 10:55 AM, rab ra <rabmdu@gmail.com> wrote:
> > Hello
> >
> > I have a use case that requires transfer of input files from remote
> storage
> > using SCP protocol (using jSCH jar).  To optimize this use case, I have
> > pre-loaded all my input files into HDFS and modified my use case so that
> it
> > copies required files from HDFS. So, when tasktrackers works, it copies
> > required number of input files to its local directory from HDFS. All my
> > tasktrackers are also datanodes. I could see my use case has run faster.
> The
> > only modification in my application is that file copy from HDFS instead
> of
> > transfer using SCP. Also, my use case involves parallel operations (run
> in
> > tasktrackers) and they do lot of file transfer. Now all these transfers
> are
> > replaced with HDFS copy.
> >
> > Can anyone tell me HDFS transfer is faster as I witnessed? Is it
> because, it
> > uses TCP/IP? Can anyone give me reasonable reasons to support the
> decrease
> > of time?
> >
> >
> > with thanks and regards
> > rab
>
>
>
> --
> Harsh J
>

Mime
View raw message