hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Loddengaard" <alexloddenga...@gmail.com>
Subject Re: architecture diagram
Date Tue, 07 Oct 2008 17:55:34 GMT
Thanks for the clarification, Samuel.  I wasn't aware that parts of a line
might be emitted depending on the split, while using TextInputFormat.
Terrence, this means that you'll have to take the approach of collecting key
=> column_count, value => column_contents in your map step.

Alex

On Mon, Oct 6, 2008 at 6:41 PM, Samuel Guo <guosijie@gmail.com> wrote:

> I think what Alex talked about 'split' is the mapreduce system's action.
> What you said about 'split' is your mapper's action.
>
> I guess that your map/reduce application uses *TextInputFormat* to treat
> your input file.
>
> your input file will first be splitted into a few splits. these splits may
> be like <filename, offset, length>. What Alex said about 'The location of
> these splits is semi-arbitrary' means that the file split's offset in your
> input file is semi-arbitrary. Am I right, Alex?
> Then *TextInputFormat* will translate these file splits into a sequence of
> lines, where offset is treated as key and line is treated as value.
>
> As these file splits are splitted by offset. Some lines in your file may be
> splitted into different file splits. A *LineRecordReader* used by
> *TextInputFormat* will remove the half-baked line in these file splits to
> make sure that every mapper will get integrated lines one by one.
>
> For examples:
>
> a file as below:
> ....
> AAA BBB CCC DDD
> EEE FFF GGG HHH
> AAA BBB CCC DDD
> ....
>
> it may be splitted into two file splits(we assume that there are two
> mappers.).
> split one:
> ....
> AAA BBB CCC
>
> split two:
> DDD
> EEE FFF GGG HHH
> AAA BBB CCC DDD
> ....
>
> take split two as example:
> TextInputFormat will use LineRecordReader to translate split two into a
> sequence of <offset, line> pairs, and it will skip the first half-baked
> line
> "DDD". so the sequence will be:
> <offset1, "EEE FFF GGG HHH">
> <offset2, "AAA BBB CCC DDD">
> ....
>
> Then what to do with the lines depends on your job.
>
>
> On Tue, Oct 7, 2008 at 5:55 AM, Terrence A. Pietrondi <
> tepietrondi@yahoo.com
> > wrote:
>
> > So looking at the following mapper...
> >
> >
> >
> http://csvdatamix.svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/src/com/datamix/pivot/PivotMapper.java?view=markup
> >
> > On line 32, you can see the row split via a delimiter. On line 43, you
> can
> > see that the field index (the column index) is the map key, and the map
> > value is the field contents. How is this incorrect? I think this follows
> > your earlier suggestion of:
> >
> > "You may want to play with the following idea: collect key =>
> column_number
> > and value => column_contents in your map step."
> >
> > Terrence A. Pietrondi
> >
> >
> > --- On Mon, 10/6/08, Alex Loddengaard <alexloddengaard@gmail.com> wrote:
> >
> > > From: Alex Loddengaard <alexloddengaard@gmail.com>
> > > Subject: Re: architecture diagram
> > > To: core-user@hadoop.apache.org
> > > Date: Monday, October 6, 2008, 12:55 PM
> > > As far as I know, splits will never be made within a line,
> > > only between
> > > rows.  To answer your question about ways to control the
> > > splits, see below:
> > >
> > > <http://wiki.apache.org/hadoop/HowManyMapsAndReduces>
> > > <
> > >
> >
> http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html
> > > >
> > >
> > > Alex
> > >
> > > On Mon, Oct 6, 2008 at 6:38 AM, Terrence A. Pietrondi
> > > <tepietrondi@yahoo.com
> > > > wrote:
> > >
> > > > Can you explain "The location of these splits is
> > > semi-arbitrary"? What if
> > > > the example was...
> > > >
> > > > AAA|BBB|CCC|DDD
> > > > EEE|FFF|GGG|HHH
> > > >
> > > >
> > > > Does this mean the split might be between CCC such
> > > that it results in
> > > > AAA|BBB|C and C|DDD for the first line? Is there a way
> > > to control this
> > > > behavior to split on my delimiter?
> > > >
> > > >
> > > > Terrence A. Pietrondi
> > > >
> > > >
> > > > --- On Sun, 10/5/08, Alex Loddengaard
> > > <alexloddengaard@gmail.com> wrote:
> > > >
> > > > > From: Alex Loddengaard
> > > <alexloddengaard@gmail.com>
> > > > > Subject: Re: architecture diagram
> > > > > To: core-user@hadoop.apache.org
> > > > > Date: Sunday, October 5, 2008, 9:26 PM
> > > > > Let's say you have one very large input file
> > > of the
> > > > > form:
> > > > >
> > > > > A|B|C|D
> > > > > E|F|G|H
> > > > > ...
> > > > > |1|2|3|4
> > > > >
> > > > > This input file will be broken up into N pieces,
> > > where N is
> > > > > the number of
> > > > > mappers that run.  The location of these splits
> > > is
> > > > > semi-arbitrary.  This
> > > > > means that unless you have one mapper, you
> > > won't be
> > > > > able to see the entire
> > > > > contents of a column in your mapper.  Given that
> > > you would
> > > > > need one mapper
> > > > > to be able to see the entirety of a column,
> > > you've now
> > > > > essentially reduced
> > > > > your problem to a single machine.
> > > > >
> > > > > You may want to play with the following idea:
> > > collect key
> > > > > => column_number
> > > > > and value => column_contents in your map step.
> > >  This
> > > > > means that you would be
> > > > > able to see the entirety of a column in your
> > > reduce step,
> > > > > though you're
> > > > > still faced with the tasks of shuffling and
> > > re-pivoting.
> > > > >
> > > > > Does this clear up your confusion?  Let me know
> > > if
> > > > > you'd like me to clarify
> > > > > more.
> > > > >
> > > > > Alex
> > > > >
> > > > > On Sun, Oct 5, 2008 at 3:54 PM, Terrence A.
> > > Pietrondi
> > > > > <tepietrondi@yahoo.com
> > > > > > wrote:
> > > > >
> > > > > > I am not sure why this doesn't fit,
> > > maybe you can
> > > > > help me understand. Your
> > > > > > previous comment was...
> > > > > >
> > > > > > "The reason I'm making this claim
> > > is because
> > > > > in order to do the pivot
> > > > > > operation you must know about every row.
> > > Your input
> > > > > files will be split at
> > > > > > semi-arbitrary places, essentially making it
> > > > > impossible for each mapper to
> > > > > > know every single row."
> > > > > >
> > > > > > Are you saying that my row segments might
> > > not actually
> > > > > be the entire row so
> > > > > > I will get a bad key index? If so, would the
> > > row
> > > > > segments be determined? I
> > > > > > based my initial work off of the word count
> > > example,
> > > > > where the lines are
> > > > > > tokenized. Does this mean in this example
> > > the row
> > > > > tokens may not be the
> > > > > > complete row?
> > > > > >
> > > > > > Thanks.
> > > > > >
> > > > > > Terrence A. Pietrondi
> > > > > >
> > > > > >
> > > > > > --- On Fri, 10/3/08, Alex Loddengaard
> > > > > <alexloddengaard@gmail.com> wrote:
> > > > > >
> > > > > > > From: Alex Loddengaard
> > > > > <alexloddengaard@gmail.com>
> > > > > > > Subject: Re: architecture diagram
> > > > > > > To: core-user@hadoop.apache.org
> > > > > > > Date: Friday, October 3, 2008, 7:14 PM
> > > > > > > The approach that you've described
> > > does not
> > > > > fit well in
> > > > > > > to the MapReduce
> > > > > > > paradigm.  You may want to consider
> > > randomizing
> > > > > your data
> > > > > > > in a different
> > > > > > > way.
> > > > > > >
> > > > > > > Unfortunately some things can't be
> > > solved
> > > > > well with
> > > > > > > MapReduce, and I think
> > > > > > > this is one of them.
> > > > > > >
> > > > > > > Can someone else say more?
> > > > > > >
> > > > > > > Alex
> > > > > > >
> > > > > > > On Fri, Oct 3, 2008 at 8:15 AM,
> > > Terrence A.
> > > > > Pietrondi
> > > > > > > <tepietrondi@yahoo.com
> > > > > > > > wrote:
> > > > > > >
> > > > > > > > Sorry for the confusion, I did
> > > make some
> > > > > typos. My
> > > > > > > example should have
> > > > > > > > looked like...
> > > > > > > >
> > > > > > > > > A|B|C
> > > > > > > > > D|E|G
> > > > > > > > >
> > > > > > > > > pivots too...
> > > > > > > > >
> > > > > > > > > D|A
> > > > > > > > > E|B
> > > > > > > > > G|C
> > > > > > > > >
> > > > > > > > > Then for each row, shuffle
> > > the contents
> > > > > around
> > > > > > > randomly...
> > > > > > > > >
> > > > > > > > > D|A
> > > > > > > > > B|E
> > > > > > > > > C|G
> > > > > > > > >
> > > > > > > > > Then pivot the data back...
> > > > > > > > >
> > > > > > > > > A|E|G
> > > > > > > > > D|B|C
> > > > > > > >
> > > > > > > > The general goal is to shuffle the
> > > elements
> > > > > in each
> > > > > > > column in the input
> > > > > > > > data. Meaning, the ordering of the
> > > elements
> > > > > in each
> > > > > > > column will not be the
> > > > > > > > same as in input.
> > > > > > > >
> > > > > > > > If you look at the initial input
> > > and compare
> > > > > to the
> > > > > > > final output, you'll
> > > > > > > > see that during the shuffling, B
> > > and E are
> > > > > swapped,
> > > > > > > and G and C are swapped,
> > > > > > > > while A and D were shuffled back
> > > into their
> > > > > > > originating positions in the
> > > > > > > > column.
> > > > > > > >
> > > > > > > > Once again, sorry for the typos
> > > and
> > > > > confusion.
> > > > > > > >
> > > > > > > > Terrence A. Pietrondi
> > > > > > > >
> > > > > > > > --- On Fri, 10/3/08, Alex
> > > Loddengaard
> > > > > > > <alexloddengaard@gmail.com>
> > > wrote:
> > > > > > > >
> > > > > > > > > From: Alex Loddengaard
> > > > > > > <alexloddengaard@gmail.com>
> > > > > > > > > Subject: Re: architecture
> > > diagram
> > > > > > > > > To:
> > > core-user@hadoop.apache.org
> > > > > > > > > Date: Friday, October 3,
> > > 2008, 11:01 AM
> > > > > > > > > Can you confirm that the
> > > example
> > > > > you've
> > > > > > > presented is
> > > > > > > > > accurate?  I think you
> > > > > > > > > may have made some typos,
> > > because the
> > > > > letter
> > > > > > > "G"
> > > > > > > > > isn't in the final
> > > result;
> > > > > > > > > I also think your first pivot
> > > > > accidentally
> > > > > > > swapped C and G.
> > > > > > > > >  I'm having a
> > > > > > > > > hard time understanding what
> > > you want
> > > > > to do,
> > > > > > > because it
> > > > > > > > > seems like your
> > > > > > > > > operations differ from your
> > > example.
> > > > > > > > >
> > > > > > > > > With that said, at first
> > > glance, this
> > > > > problem may
> > > > > > > not fit
> > > > > > > > > well in to the
> > > > > > > > > MapReduce paradigm.  The
> > > reason I'm
> > > > > making
> > > > > > > this claim
> > > > > > > > > is because in order to
> > > > > > > > > do the pivot operation you
> > > must know
> > > > > about every
> > > > > > > row.  Your
> > > > > > > > > input files will
> > > > > > > > > be split at semi-arbitrary
> > > places,
> > > > > essentially
> > > > > > > making it
> > > > > > > > > impossible for each
> > > > > > > > > mapper to know every single
> > > row.  There
> > > > > may be a
> > > > > > > way to do
> > > > > > > > > this by
> > > > > > > > > collecting, in your map step,
> > > key =>
> > > > > column
> > > > > > > number (0,
> > > > > > > > > 1, 2, etc) and value
> > > > > > > > > => (A, B, C, etc), though
> > > you may
> > > > > run in to
> > > > > > > problems
> > > > > > > > > when you try to pivot
> > > > > > > > > back.  I say this because
> > > when you
> > > > > pivot back,
> > > > > > > you need to
> > > > > > > > > have each column,
> > > > > > > > > which means you'll need
> > > one reduce
> > > > > step.
> > > > > > > There may be
> > > > > > > > > a way to put the
> > > > > > > > > pivot-back operation in a
> > > second
> > > > > iteration,
> > > > > > > though I
> > > > > > > > > don't think that would
> > > > > > > > > help you.
> > > > > > > > >
> > > > > > > > > Terrence, please confirm that
> > > > > you've defined
> > > > > > > your
> > > > > > > > > example correctly.  In the
> > > > > > > > > meantime, can someone else
> > > confirm that
> > > > > this
> > > > > > > problem does
> > > > > > > > > not fit will in to
> > > > > > > > > the MapReduce paradigm?
> > > > > > > > >
> > > > > > > > > Alex
> > > > > > > > >
> > > > > > > > > On Thu, Oct 2, 2008 at 10:48
> > > AM,
> > > > > Terrence A.
> > > > > > > Pietrondi <
> > > > > > > > > tepietrondi@yahoo.com>
> > > wrote:
> > > > > > > > >
> > > > > > > > > > I am trying to write a
> > > map reduce
> > > > > > > implementation to do
> > > > > > > > > the following:
> > > > > > > > > >
> > > > > > > > > > 1) read tabular data
> > > delimited in
> > > > > some
> > > > > > > fashion
> > > > > > > > > > 2) pivot that data, so
> > > the rows
> > > > > are columns
> > > > > > > and the
> > > > > > > > > columns are rows
> > > > > > > > > > 3) shuffle the rows
> > > (that were the
> > > > > columns)
> > > > > > > to
> > > > > > > > > randomize the data
> > > > > > > > > > 4) pivot the data back
> > > > > > > > > >
> > > > > > > > > > For example.....
> > > > > > > > > >
> > > > > > > > > > A|B|C
> > > > > > > > > > D|E|G
> > > > > > > > > >
> > > > > > > > > > pivots too...
> > > > > > > > > >
> > > > > > > > > > D|A
> > > > > > > > > > E|B
> > > > > > > > > > C|G
> > > > > > > > > >
> > > > > > > > > > Then for each row,
> > > shuffle the
> > > > > contents
> > > > > > > around
> > > > > > > > > randomly...
> > > > > > > > > >
> > > > > > > > > > D|A
> > > > > > > > > > B|E
> > > > > > > > > > G|C
> > > > > > > > > >
> > > > > > > > > > Then pivot the data
> > > back...
> > > > > > > > > >
> > > > > > > > > > A|E|C
> > > > > > > > > > D|B|C
> > > > > > > > > >
> > > > > > > > > > You can reference my
> > > progress so
> > > > > far...
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> http://svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/
> > > > > > > > > >
> > > > > > > > > > Terrence A. Pietrondi
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --- On Thu, 10/2/08,
> > > Alex
> > > > > Loddengaard
> > > > > > > > >
> > > <alexloddengaard@gmail.com>
> > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > From: Alex
> > > Loddengaard
> > > > > > > > >
> > > <alexloddengaard@gmail.com>
> > > > > > > > > > > Subject: Re:
> > > architecture
> > > > > diagram
> > > > > > > > > > > To:
> > > > > core-user@hadoop.apache.org
> > > > > > > > > > > Date: Thursday,
> > > October 2,
> > > > > 2008, 1:36
> > > > > > > PM
> > > > > > > > > > > I think it really
> > > depends on
> > > > > the job as
> > > > > > > to where
> > > > > > > > > logic goes.
> > > > > > > > > > >  Sometimes your
> > > > > > > > > > > reduce step is as
> > > simple as
> > > > > an identify
> > > > > > > function,
> > > > > > > > > and
> > > > > > > > > > > sometimes it can be
> > > > > > > > > > > more complex than
> > > your map
> > > > > step.  It
> > > > > > > all depends
> > > > > > > > > on your
> > > > > > > > > > > data and the
> > > > > > > > > > > operation(s)
> > > you're
> > > > > trying to
> > > > > > > perform.
> > > > > > > > > > >
> > > > > > > > > > > Perhaps we should
> > > step out of
> > > > > the
> > > > > > > abstract.  Do
> > > > > > > > > you have a
> > > > > > > > > > > specific problem
> > > > > > > > > > > you're trying
> > > to solve?
> > > > > Can you
> > > > > > > describe it?
> > > > > > > > > > >
> > > > > > > > > > > Alex
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Oct 2, 2008
> > > at 4:55
> > > > > AM,
> > > > > > > Terrence A.
> > > > > > > > > Pietrondi
> > > > > > > > > > >
> > > <tepietrondi@yahoo.com
> > > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > I am sorry for
> > > the
> > > > > confusion. I
> > > > > > > meant
> > > > > > > > > distributed
> > > > > > > > > > > data.
> > > > > > > > > > > >
> > > > > > > > > > > > So help me out
> > > here. For
> > > > > example,
> > > > > > > if I am
> > > > > > > > > reducing to
> > > > > > > > > > > a single file, then
> > > > > > > > > > > > my main
> > > transformation
> > > > > logic would
> > > > > > > be in my
> > > > > > > > > mapping
> > > > > > > > > > > step since I am
> > > reducing
> > > > > > > > > > > > away from the
> > > data?
> > > > > > > > > > > >
> > > > > > > > > > > > Terrence A.
> > > Pietrondi
> > > > > > > > > > > >
> > > > > http://del.icio.us/tepietrondi
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > --- On Wed,
> > > 10/1/08,
> > > > > Alex
> > > > > > > Loddengaard
> > > > > > > > > > >
> > > > > <alexloddengaard@gmail.com>
> > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > From:
> > > Alex
> > > > > Loddengaard
> > > > > > > > > > >
> > > > > <alexloddengaard@gmail.com>
> > > > > > > > > > > > > Subject:
> > > Re:
> > > > > architecture
> > > > > > > diagram
> > > > > > > > > > > > > To:
> > > > > > > core-user@hadoop.apache.org
> > > > > > > > > > > > > Date:
> > > Wednesday,
> > > > > October 1,
> > > > > > > 2008, 7:44
> > > > > > > > > PM
> > > > > > > > > > > > > I'm
> > > not sure
> > > > > what you
> > > > > > > mean by
> > > > > > > > > > > "disconnected
> > > parts
> > > > > > > > > > > > > of
> > > data," but
> > > > > Hadoop is
> > > > > > > > > > > > >
> > > implemented to try
> > > > > and
> > > > > > > perform map
> > > > > > > > > tasks on
> > > > > > > > > > > machines that
> > > > > > > > > > > > > have
> > > input data.
> > > > > > > > > > > > > This is
> > > to lower
> > > > > the amount
> > > > > > > of network
> > > > > > > > > traffic,
> > > > > > > > > > > hence
> > > > > > > > > > > > > making
> > > the entire
> > > > > job
> > > > > > > > > > > > > run
> > > faster.  Hadoop
> > > > > does all
> > > > > > > this for
> > > > > > > > > you under
> > > > > > > > > > > the hood.
> > > > > > > > > > > > > From a
> > > user's
> > > > > > > > > > > > > point of
> > > view, all
> > > > > you need
> > > > > > > to do is
> > > > > > > > > store data
> > > > > > > > > > > in HDFS
> > > > > > > > > > > > > (the
> > > distributed
> > > > > > > > > > > > >
> > > filesystem), and
> > > > > run
> > > > > > > MapReduce jobs on
> > > > > > > > > that data.
> > > > > > > > > > >  Take a
> > > > > > > > > > > > > look
> > > here:
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > >
> > > > > <http://wiki.apache.org/hadoop/WordCount>
> > > > > > > > > > > > >
> > > > > > > > > > > > > Alex
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed,
> > > Oct 1, 2008
> > > > > at 1:11
> > > > > > > PM,
> > > > > > > > > Terrence A.
> > > > > > > > > > > Pietrondi
> > > > > > > > > > > > >
> > > > > <tepietrondi@yahoo.com
> > > > > > > > > > > > > >
> > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > So
> > > to be
> > > > > > > "distributed"
> > > > > > > > > in a sense,
> > > > > > > > > > > you would
> > > > > > > > > > > > > want to
> > > do your
> > > > > computation
> > > > > > > on
> > > > > > > > > > > > > > the
> > > > > disconnected parts
> > > > > > > of data in
> > > > > > > > > the map
> > > > > > > > > > > phase I
> > > > > > > > > > > > > would
> > > guess?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > Terrence A.
> > > > > Pietrondi
> > > > > > > > > > > > > >
> > > > > > > http://del.icio.us/tepietrondi
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > ---
> > > On Wed,
> > > > > 10/1/08,
> > > > > > > Arun C Murthy
> > > > > > > > > > > > >
> > > > > <acm@yahoo-inc.com>
> > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > From:
> > > > > Arun C Murthy
> > > > > > > > > > >
> > > <acm@yahoo-inc.com>
> > > > > > > > > > > > > > >
> > > Subject:
> > > > > Re:
> > > > > > > architecture
> > > > > > > > > diagram
> > > > > > > > > > > > > > >
> > > To:
> > > > > > > > > core-user@hadoop.apache.org
> > > > > > > > > > > > > > >
> > > Date:
> > > > > Wednesday,
> > > > > > > October 1,
> > > > > > > > > 2008, 2:16
> > > > > > > > > > > PM
> > > > > > > > > > > > > > >
> > > On Oct 1,
> > > > > 2008, at
> > > > > > > 10:17 AM,
> > > > > > > > > Terrence
> > > > > > > > > > > A.
> > > > > > > > > > > > > Pietrondi
> > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > I am
> > > > > trying to
> > > > > > > plan out
> > > > > > > > > my
> > > > > > > > > > > map-reduce
> > > > > > > > > > > > >
> > > implementation
> > > > > > > > > > > > > > >
> > > and I
> > > > > have some
> > > > > > > > > > > > > > >
> > > >
> > > > > questions of
> > > > > > > where
> > > > > > > > > computation
> > > > > > > > > > > should be
> > > > > > > > > > > > > split in
> > > > > > > > > > > > > > >
> > > order to
> > > > > take
> > > > > > > > > > > > > > >
> > > >
> > > > > advantage of
> > > > > > > the
> > > > > > > > > distributed
> > > > > > > > > > > nodes.
> > > > > > > > > > > > > > >
> > > >
> > > > > > > > > > > > > > >
> > > >
> > > > > Looking at the
> > > > > > > > > architecture
> > > > > > > > > > > diagram
> > > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > > (http://hadoop.apache.org/core/images/architecture.gif
> > > > > > > > > > > > > > >
> > > > ),
> > > > > are the map
> > > > > > > boxes the
> > > > > > > > > major
> > > > > > > > > > > computation
> > > > > > > > > > > > > areas or
> > > is
> > > > > > > > > > > > > > >
> > > the
> > > > > reduce
> > > > > > > > > > > > > > >
> > > > the
> > > > > major
> > > > > > > computation
> > > > > > > > > area?
> > > > > > > > > > > > > > >
> > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > Usually
> > > > > the maps
> > > > > > > perform the
> > > > > > > > > > > 'embarrassingly
> > > > > > > > > > > > > > >
> > > > > parallel'
> > > > > > > computational
> > > > > > > > > > > > > > >
> > > steps
> > > > > where-in each
> > > > > > > map works
> > > > > > > > > > > independently on a
> > > > > > > > > > > > > > >
> > > > > 'split' on
> > > > > > > your input
> > > > > > > > > > > > > > >
> > > and the
> > > > > reduces
> > > > > > > perform the
> > > > > > > > > > > 'aggregate'
> > > > > > > > > > > > > > >
> > > > > computations.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > >  From
> > > > > > > > >
> > > http://hadoop.apache.org/core/ :
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > Hadoop
> > > > > implements
> > > > > > > MapReduce,
> > > > > > > > > using the
> > > > > > > > > > > Hadoop
> > > > > > > > > > > > >
> > > Distributed
> > > > > > > > > > > > > > >
> > > File
> > > > > System
> > > > > > > > > > > > > > >
> > > (HDFS).
> > > > > MapReduce
> > > > > > > divides
> > > > > > > > > applications
> > > > > > > > > > > into many
> > > > > > > > > > > > > small
> > > > > > > > > > > > > > >
> > > blocks of
> > > > > work.
> > > > > > > > > > > > > > >
> > > HDFS
> > > > > creates
> > > > > > > multiple
> > > > > > > > > replicas of data
> > > > > > > > > > > blocks for
> > > > > > > > > > > > > > >
> > > > > reliability,
> > > > > > > placing
> > > > > > > > > > > > > > >
> > > them on
> > > > > compute
> > > > > > > nodes around
> > > > > > > > > the
> > > > > > > > > > > cluster.
> > > > > > > > > > > > > MapReduce
> > > can
> > > > > > > > > > > > > > >
> > > then
> > > > > process
> > > > > > > > > > > > > > >
> > > the data
> > > > > where it
> > > > > > > is located.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > The
> > > > > Hadoop
> > > > > > > Map-Reduce
> > > > > > > > > framework is
> > > > > > > > > > > quite good at
> > > > > > > > > > > > >
> > > scheduling
> > > > > > > > > > > > > > >
> > > your
> > > > > > > > > > > > > > >
> > > > > 'maps' on
> > > > > > > the actual
> > > > > > > > > data-nodes
> > > > > > > > > > > where the
> > > > > > > > > > > > > > >
> > > > > input-blocks are
> > > > > > > present,
> > > > > > > > > > > > > > >
> > > leading
> > > > > to i/o
> > > > > > > > > efficiencies...
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > Arun
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > >
> > > > > Thanks.
> > > > > > > > > > > > > > >
> > > >
> > > > > > > > > > > > > > >
> > > >
> > > > > Terrence A.
> > > > > > > Pietrondi
> > > > > > > > > > > > > > >
> > > >
> > > > > > > > > > > > > > >
> > > >
> > > > > > > > > > > > > > >
> > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > >
> > > >
> > > >
> > > >
> >
> >
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message