hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: Sailfish
Date Fri, 11 May 2012 06:32:59 GMT
Hey Sriram,

We discussed this before, but for the benefit of the wider audience: :)

It seems like the requirements imposed on KFS by Sailfish are in most
ways much simplier than the requirements of a full distributed
filesystem. The one thing we need is atomic record append -- but we
don't need anything else, like filesystem metadata/naming,
replication, corrupt data scanning, etc. All of the data is
transient/short-lived and at replication count 1.

So I think building something specific to this use case would be
pretty practical - and my guess is it might even have some benefits
over trying to use a full DFS.

In the MR2 architecture, I'd probably try to build this as a service
plugin in the NodeManager (similar to the way that the ShuffleHandler
in the current implementation works)

-Todd

On Thu, May 10, 2012 at 11:01 PM, Sriram Rao <sriramsrao@gmail.com> wrote:
> Srivas,
>
> Sailfish is builds upon record append (a feature not present in HDFS).
>
> The software that is currently released is based on Hadoop-0.20.2.  You use
> the Sailfish version of Hadoop-0.20.2, KFS for the intermediate data, and
> then HDFS (or KFS) for storing the job/input.  Since the changes are all in
> the handling of map output/reduce input, it is transparent to existing jobs.
>
> What is being proposed below is to bolt all the starting/stopping of the
> related deamons into YARN as a first step.  There are other approaches that
> are possible, which have a similar effect.
>
> Hope this helps.
>
> Sriram
>
>
> On Thu, May 10, 2012 at 10:50 PM, M. C. Srivas <mcsrivas@gmail.com> wrote:
>
>> Sriram,   Sailfish depends on append. I just noticed the HDFS disabled
>> append. How does one use this with Hadoop?
>>
>>
>> On Wed, May 9, 2012 at 9:00 AM, Otis Gospodnetic <
>> otis_gospodnetic@yahoo.com
>> > wrote:
>>
>> > Hi Sriram,
>> >
>> > >> The I-file concept could possibly be implemented here in a fairly self
>> > contained way. One
>> > >> could even colocate/embed a KFS filesystem with such an alternate
>> > >> shuffle, like how MR task temporary space is usually colocated with
>> > >> HDFS storage.
>> >
>> > >  Exactly.
>> >
>> > >> Does this seem reasonable in any way?
>> >
>> > > Great. Where do go from here?  How do we get a colloborative effort
>> > going?
>> >
>> >
>> > Sounds like a JIRA issue should be opened, the approach briefly
>> described,
>> > and the first implementation attempt made.  Then iterate.
>> >
>> > I look forward to seeing this! :)
>> >
>> > Otis
>> > --
>> >
>> > Performance Monitoring for Solr / ElasticSearch / HBase -
>> > http://sematext.com/spm
>> >
>> >
>> >
>> > >________________________________
>> > > From: Sriram Rao <sriramsrao@gmail.com>
>> > >To: common-dev@hadoop.apache.org
>> > >Sent: Tuesday, May 8, 2012 6:48 PM
>> > >Subject: Re: Sailfish
>> > >
>> > >Dear Andy,
>> > >
>> > >> From: Andrew Purtell <apurt...@apache.org>
>> > >> ...
>> > >
>> > >> Do you intend this to be a joint project with the Hadoop community
or
>> > >> a technology competitor?
>> > >
>> > >As I had said in my email, we are looking for folks to colloborate
>> > >with us to help get us integrated with Hadoop.  So, to be explicitly
>> > >clear, we are intending for this to be a joint project with the
>> > >community.
>> > >
>> > >> Regrettably, KFS is not a "drop in replacement" for HDFS.
>> > >> Hypothetically: I have several petabytes of data in an existing HDFS
>> > >> deployment, which is the norm, and a continuous MapReduce workflow.
>> > >> How do you propose I, practically, migrate to something like Sailfish
>> > >> without a major capital expenditure and/or downtime and/or data loss?
>> > >
>> > >Well, we are not asking for KFS to replace HDFS.  One path you could
>> > >take is to experiment with Sailfish---use KFS just for the
>> > >intermediate data and HDFS for everything else.  There is no major
>> > >capex :).  While you get comfy with pushing intermediate data into a
>> > >DFS, we get the ideas added to HDFS.  This simplifies deployment
>> > >considerations.
>> > >
>> > >> However, can the Sailfish I-files implementation be plugged in as an
>> > >> alternate Shuffle implementation in MRv2 (see MAPREDUCE-3060 and
>> > >> MAPREDUCE-4049),
>> > >
>> > >This'd be great!
>> > >
>> > >> with necessary additional plumbing for dynamic
>> > >> adjustment of reduce task population? And the workbuilder could be
>> > >> part of an alternate MapReduce Application Manager?
>> > >
>> > >It should be part of the AM.  (Currently, with our implementation in
>> > >Hadoop-0.20.2, the workbuilder serves the role of an AM).
>> > >
>> > >> The I-file concept could possibly be implemented here in a fairly self
>> > contained way. One
>> > >> could even colocate/embed a KFS filesystem with such an alternate
>> > >> shuffle, like how MR task temporary space is usually colocated with
>> > >> HDFS storage.
>> > >
>> > >Exactly.
>> > >
>> > >> Does this seem reasonable in any way?
>> > >
>> > >Great. Where do go from here?  How do we get a colloborative effort
>> going?
>> > >
>> > >Best,
>> > >
>> > >Sriram
>> > >
>> > >>>  From: Sriram Rao <sriramsrao@gmail.com>
>> > >>> To: common-dev@hadoop.apache.org
>> > >>> Sent: Tuesday, May 8, 2012 10:32 AM
>> > >>> Subject: Project announcement: Sailfish (also, looking for
>> > colloborators)
>> > >>>
>> > >>> Hi,
>> > >>>
>> > >>> I'd like to announce the release of a new open source project,
>> > Sailfish.
>> > >>>
>> > >>> http://code.google.com/p/sailfish/
>> > >>>
>> > >>> Sailfish tries to improve Hadoop-performance, particularly for
>> > large-jobs
>> > >>> which process TB's of data and run for hours.  In building Sailfish,
>> we
>> > >>> modify how map-output is handled and transported from map->reduce.
>> > >>>
>> > >>> The project pages provide more information about the project.
>> > >>>
>> > >>> We are looking for colloborators who can help get some of the ideas
>> > into
>> > >>> Apache Hadoop. A possible step forward could be to make "shuffle"
>> > phase of
>> > >>> Hadoop pluggable.
>> > >>>
>> > >>> If you are interested in working with us, please get in touch with
>> me.
>> > >>>
>> > >>> Sriram
>> > >>
>> > >
>> > >
>> > >
>> > >--
>> > >Best regards,
>> > >
>> > >   - Andy
>> > >
>> > >Problems worthy of attack prove their worth by hitting back. - Piet
>> > >Hein (via Tom White)
>> > >
>> > >
>> > >
>> >
>>



-- 
Todd Lipcon
Software Engineer, Cloudera

Mime
View raw message