hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Loddengaard" <alexloddenga...@gmail.com>
Subject Re: architecture diagram
Date Mon, 06 Oct 2008 16:55:01 GMT
As far as I know, splits will never be made within a line, only between
rows.  To answer your question about ways to control the splits, see below:

<http://wiki.apache.org/hadoop/HowManyMapsAndReduces>
<
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html
>

Alex

On Mon, Oct 6, 2008 at 6:38 AM, Terrence A. Pietrondi <tepietrondi@yahoo.com
> wrote:

> Can you explain "The location of these splits is semi-arbitrary"? What if
> the example was...
>
> AAA|BBB|CCC|DDD
> EEE|FFF|GGG|HHH
>
>
> Does this mean the split might be between CCC such that it results in
> AAA|BBB|C and C|DDD for the first line? Is there a way to control this
> behavior to split on my delimiter?
>
>
> Terrence A. Pietrondi
>
>
> --- On Sun, 10/5/08, Alex Loddengaard <alexloddengaard@gmail.com> wrote:
>
> > From: Alex Loddengaard <alexloddengaard@gmail.com>
> > Subject: Re: architecture diagram
> > To: core-user@hadoop.apache.org
> > Date: Sunday, October 5, 2008, 9:26 PM
> > Let's say you have one very large input file of the
> > form:
> >
> > A|B|C|D
> > E|F|G|H
> > ...
> > |1|2|3|4
> >
> > This input file will be broken up into N pieces, where N is
> > the number of
> > mappers that run.  The location of these splits is
> > semi-arbitrary.  This
> > means that unless you have one mapper, you won't be
> > able to see the entire
> > contents of a column in your mapper.  Given that you would
> > need one mapper
> > to be able to see the entirety of a column, you've now
> > essentially reduced
> > your problem to a single machine.
> >
> > You may want to play with the following idea: collect key
> > => column_number
> > and value => column_contents in your map step.  This
> > means that you would be
> > able to see the entirety of a column in your reduce step,
> > though you're
> > still faced with the tasks of shuffling and re-pivoting.
> >
> > Does this clear up your confusion?  Let me know if
> > you'd like me to clarify
> > more.
> >
> > Alex
> >
> > On Sun, Oct 5, 2008 at 3:54 PM, Terrence A. Pietrondi
> > <tepietrondi@yahoo.com
> > > wrote:
> >
> > > I am not sure why this doesn't fit, maybe you can
> > help me understand. Your
> > > previous comment was...
> > >
> > > "The reason I'm making this claim is because
> > in order to do the pivot
> > > operation you must know about every row. Your input
> > files will be split at
> > > semi-arbitrary places, essentially making it
> > impossible for each mapper to
> > > know every single row."
> > >
> > > Are you saying that my row segments might not actually
> > be the entire row so
> > > I will get a bad key index? If so, would the row
> > segments be determined? I
> > > based my initial work off of the word count example,
> > where the lines are
> > > tokenized. Does this mean in this example the row
> > tokens may not be the
> > > complete row?
> > >
> > > Thanks.
> > >
> > > Terrence A. Pietrondi
> > >
> > >
> > > --- On Fri, 10/3/08, Alex Loddengaard
> > <alexloddengaard@gmail.com> wrote:
> > >
> > > > From: Alex Loddengaard
> > <alexloddengaard@gmail.com>
> > > > Subject: Re: architecture diagram
> > > > To: core-user@hadoop.apache.org
> > > > Date: Friday, October 3, 2008, 7:14 PM
> > > > The approach that you've described does not
> > fit well in
> > > > to the MapReduce
> > > > paradigm.  You may want to consider randomizing
> > your data
> > > > in a different
> > > > way.
> > > >
> > > > Unfortunately some things can't be solved
> > well with
> > > > MapReduce, and I think
> > > > this is one of them.
> > > >
> > > > Can someone else say more?
> > > >
> > > > Alex
> > > >
> > > > On Fri, Oct 3, 2008 at 8:15 AM, Terrence A.
> > Pietrondi
> > > > <tepietrondi@yahoo.com
> > > > > wrote:
> > > >
> > > > > Sorry for the confusion, I did make some
> > typos. My
> > > > example should have
> > > > > looked like...
> > > > >
> > > > > > A|B|C
> > > > > > D|E|G
> > > > > >
> > > > > > pivots too...
> > > > > >
> > > > > > D|A
> > > > > > E|B
> > > > > > G|C
> > > > > >
> > > > > > Then for each row, shuffle the contents
> > around
> > > > randomly...
> > > > > >
> > > > > > D|A
> > > > > > B|E
> > > > > > C|G
> > > > > >
> > > > > > Then pivot the data back...
> > > > > >
> > > > > > A|E|G
> > > > > > D|B|C
> > > > >
> > > > > The general goal is to shuffle the elements
> > in each
> > > > column in the input
> > > > > data. Meaning, the ordering of the elements
> > in each
> > > > column will not be the
> > > > > same as in input.
> > > > >
> > > > > If you look at the initial input and compare
> > to the
> > > > final output, you'll
> > > > > see that during the shuffling, B and E are
> > swapped,
> > > > and G and C are swapped,
> > > > > while A and D were shuffled back into their
> > > > originating positions in the
> > > > > column.
> > > > >
> > > > > Once again, sorry for the typos and
> > confusion.
> > > > >
> > > > > Terrence A. Pietrondi
> > > > >
> > > > > --- On Fri, 10/3/08, Alex Loddengaard
> > > > <alexloddengaard@gmail.com> wrote:
> > > > >
> > > > > > From: Alex Loddengaard
> > > > <alexloddengaard@gmail.com>
> > > > > > Subject: Re: architecture diagram
> > > > > > To: core-user@hadoop.apache.org
> > > > > > Date: Friday, October 3, 2008, 11:01 AM
> > > > > > Can you confirm that the example
> > you've
> > > > presented is
> > > > > > accurate?  I think you
> > > > > > may have made some typos, because the
> > letter
> > > > "G"
> > > > > > isn't in the final result;
> > > > > > I also think your first pivot
> > accidentally
> > > > swapped C and G.
> > > > > >  I'm having a
> > > > > > hard time understanding what you want
> > to do,
> > > > because it
> > > > > > seems like your
> > > > > > operations differ from your example.
> > > > > >
> > > > > > With that said, at first glance, this
> > problem may
> > > > not fit
> > > > > > well in to the
> > > > > > MapReduce paradigm.  The reason I'm
> > making
> > > > this claim
> > > > > > is because in order to
> > > > > > do the pivot operation you must know
> > about every
> > > > row.  Your
> > > > > > input files will
> > > > > > be split at semi-arbitrary places,
> > essentially
> > > > making it
> > > > > > impossible for each
> > > > > > mapper to know every single row.  There
> > may be a
> > > > way to do
> > > > > > this by
> > > > > > collecting, in your map step, key =>
> > column
> > > > number (0,
> > > > > > 1, 2, etc) and value
> > > > > > => (A, B, C, etc), though you may
> > run in to
> > > > problems
> > > > > > when you try to pivot
> > > > > > back.  I say this because when you
> > pivot back,
> > > > you need to
> > > > > > have each column,
> > > > > > which means you'll need one reduce
> > step.
> > > > There may be
> > > > > > a way to put the
> > > > > > pivot-back operation in a second
> > iteration,
> > > > though I
> > > > > > don't think that would
> > > > > > help you.
> > > > > >
> > > > > > Terrence, please confirm that
> > you've defined
> > > > your
> > > > > > example correctly.  In the
> > > > > > meantime, can someone else confirm that
> > this
> > > > problem does
> > > > > > not fit will in to
> > > > > > the MapReduce paradigm?
> > > > > >
> > > > > > Alex
> > > > > >
> > > > > > On Thu, Oct 2, 2008 at 10:48 AM,
> > Terrence A.
> > > > Pietrondi <
> > > > > > tepietrondi@yahoo.com> wrote:
> > > > > >
> > > > > > > I am trying to write a map reduce
> > > > implementation to do
> > > > > > the following:
> > > > > > >
> > > > > > > 1) read tabular data delimited in
> > some
> > > > fashion
> > > > > > > 2) pivot that data, so the rows
> > are columns
> > > > and the
> > > > > > columns are rows
> > > > > > > 3) shuffle the rows (that were the
> > columns)
> > > > to
> > > > > > randomize the data
> > > > > > > 4) pivot the data back
> > > > > > >
> > > > > > > For example.....
> > > > > > >
> > > > > > > A|B|C
> > > > > > > D|E|G
> > > > > > >
> > > > > > > pivots too...
> > > > > > >
> > > > > > > D|A
> > > > > > > E|B
> > > > > > > C|G
> > > > > > >
> > > > > > > Then for each row, shuffle the
> > contents
> > > > around
> > > > > > randomly...
> > > > > > >
> > > > > > > D|A
> > > > > > > B|E
> > > > > > > G|C
> > > > > > >
> > > > > > > Then pivot the data back...
> > > > > > >
> > > > > > > A|E|C
> > > > > > > D|B|C
> > > > > > >
> > > > > > > You can reference my progress so
> > far...
> > > > > > >
> > > > > > >
> > > > > >
> > > >
> > http://svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/
> > > > > > >
> > > > > > > Terrence A. Pietrondi
> > > > > > >
> > > > > > >
> > > > > > > --- On Thu, 10/2/08, Alex
> > Loddengaard
> > > > > > <alexloddengaard@gmail.com>
> > wrote:
> > > > > > >
> > > > > > > > From: Alex Loddengaard
> > > > > > <alexloddengaard@gmail.com>
> > > > > > > > Subject: Re: architecture
> > diagram
> > > > > > > > To:
> > core-user@hadoop.apache.org
> > > > > > > > Date: Thursday, October 2,
> > 2008, 1:36
> > > > PM
> > > > > > > > I think it really depends on
> > the job as
> > > > to where
> > > > > > logic goes.
> > > > > > > >  Sometimes your
> > > > > > > > reduce step is as simple as
> > an identify
> > > > function,
> > > > > > and
> > > > > > > > sometimes it can be
> > > > > > > > more complex than your map
> > step.  It
> > > > all depends
> > > > > > on your
> > > > > > > > data and the
> > > > > > > > operation(s) you're
> > trying to
> > > > perform.
> > > > > > > >
> > > > > > > > Perhaps we should step out of
> > the
> > > > abstract.  Do
> > > > > > you have a
> > > > > > > > specific problem
> > > > > > > > you're trying to solve?
> > Can you
> > > > describe it?
> > > > > > > >
> > > > > > > > Alex
> > > > > > > >
> > > > > > > > On Thu, Oct 2, 2008 at 4:55
> > AM,
> > > > Terrence A.
> > > > > > Pietrondi
> > > > > > > > <tepietrondi@yahoo.com
> > > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > I am sorry for the
> > confusion. I
> > > > meant
> > > > > > distributed
> > > > > > > > data.
> > > > > > > > >
> > > > > > > > > So help me out here. For
> > example,
> > > > if I am
> > > > > > reducing to
> > > > > > > > a single file, then
> > > > > > > > > my main transformation
> > logic would
> > > > be in my
> > > > > > mapping
> > > > > > > > step since I am reducing
> > > > > > > > > away from the data?
> > > > > > > > >
> > > > > > > > > Terrence A. Pietrondi
> > > > > > > > >
> > http://del.icio.us/tepietrondi
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --- On Wed, 10/1/08,
> > Alex
> > > > Loddengaard
> > > > > > > >
> > <alexloddengaard@gmail.com>
> > > > wrote:
> > > > > > > > >
> > > > > > > > > > From: Alex
> > Loddengaard
> > > > > > > >
> > <alexloddengaard@gmail.com>
> > > > > > > > > > Subject: Re:
> > architecture
> > > > diagram
> > > > > > > > > > To:
> > > > core-user@hadoop.apache.org
> > > > > > > > > > Date: Wednesday,
> > October 1,
> > > > 2008, 7:44
> > > > > > PM
> > > > > > > > > > I'm not sure
> > what you
> > > > mean by
> > > > > > > > "disconnected parts
> > > > > > > > > > of data," but
> > Hadoop is
> > > > > > > > > > implemented to try
> > and
> > > > perform map
> > > > > > tasks on
> > > > > > > > machines that
> > > > > > > > > > have input data.
> > > > > > > > > > This is to lower
> > the amount
> > > > of network
> > > > > > traffic,
> > > > > > > > hence
> > > > > > > > > > making the entire
> > job
> > > > > > > > > > run faster.  Hadoop
> > does all
> > > > this for
> > > > > > you under
> > > > > > > > the hood.
> > > > > > > > > > From a user's
> > > > > > > > > > point of view, all
> > you need
> > > > to do is
> > > > > > store data
> > > > > > > > in HDFS
> > > > > > > > > > (the distributed
> > > > > > > > > > filesystem), and
> > run
> > > > MapReduce jobs on
> > > > > > that data.
> > > > > > > >  Take a
> > > > > > > > > > look here:
> > > > > > > > > >
> > > > > > > > > >
> > > > > >
> > <http://wiki.apache.org/hadoop/WordCount>
> > > > > > > > > >
> > > > > > > > > > Alex
> > > > > > > > > >
> > > > > > > > > > On Wed, Oct 1, 2008
> > at 1:11
> > > > PM,
> > > > > > Terrence A.
> > > > > > > > Pietrondi
> > > > > > > > > >
> > <tepietrondi@yahoo.com
> > > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > So to be
> > > > "distributed"
> > > > > > in a sense,
> > > > > > > > you would
> > > > > > > > > > want to do your
> > computation
> > > > on
> > > > > > > > > > > the
> > disconnected parts
> > > > of data in
> > > > > > the map
> > > > > > > > phase I
> > > > > > > > > > would guess?
> > > > > > > > > > >
> > > > > > > > > > > Terrence A.
> > Pietrondi
> > > > > > > > > > >
> > > > http://del.icio.us/tepietrondi
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --- On Wed,
> > 10/1/08,
> > > > Arun C Murthy
> > > > > > > > > >
> > <acm@yahoo-inc.com>
> > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > From:
> > Arun C Murthy
> > > > > > > > <acm@yahoo-inc.com>
> > > > > > > > > > > > Subject:
> > Re:
> > > > architecture
> > > > > > diagram
> > > > > > > > > > > > To:
> > > > > > core-user@hadoop.apache.org
> > > > > > > > > > > > Date:
> > Wednesday,
> > > > October 1,
> > > > > > 2008, 2:16
> > > > > > > > PM
> > > > > > > > > > > > On Oct 1,
> > 2008, at
> > > > 10:17 AM,
> > > > > > Terrence
> > > > > > > > A.
> > > > > > > > > > Pietrondi wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > I am
> > trying to
> > > > plan out
> > > > > > my
> > > > > > > > map-reduce
> > > > > > > > > > implementation
> > > > > > > > > > > > and I
> > have some
> > > > > > > > > > > > >
> > questions of
> > > > where
> > > > > > computation
> > > > > > > > should be
> > > > > > > > > > split in
> > > > > > > > > > > > order to
> > take
> > > > > > > > > > > > >
> > advantage of
> > > > the
> > > > > > distributed
> > > > > > > > nodes.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > Looking at the
> > > > > > architecture
> > > > > > > > diagram
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > (http://hadoop.apache.org/core/images/architecture.gif
> > > > > > > > > > > > > ),
> > are the map
> > > > boxes the
> > > > > > major
> > > > > > > > computation
> > > > > > > > > > areas or is
> > > > > > > > > > > > the
> > reduce
> > > > > > > > > > > > > the
> > major
> > > > computation
> > > > > > area?
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Usually
> > the maps
> > > > perform the
> > > > > > > > 'embarrassingly
> > > > > > > > > > > >
> > parallel'
> > > > computational
> > > > > > > > > > > > steps
> > where-in each
> > > > map works
> > > > > > > > independently on a
> > > > > > > > > > > >
> > 'split' on
> > > > your input
> > > > > > > > > > > > and the
> > reduces
> > > > perform the
> > > > > > > > 'aggregate'
> > > > > > > > > > > >
> > computations.
> > > > > > > > > > > >
> > > > > > > > > > > >  From
> > > > > > http://hadoop.apache.org/core/ :
> > > > > > > > > > > >
> > > > > > > > > > > > Hadoop
> > implements
> > > > MapReduce,
> > > > > > using the
> > > > > > > > Hadoop
> > > > > > > > > > Distributed
> > > > > > > > > > > > File
> > System
> > > > > > > > > > > > (HDFS).
> > MapReduce
> > > > divides
> > > > > > applications
> > > > > > > > into many
> > > > > > > > > > small
> > > > > > > > > > > > blocks of
> > work.
> > > > > > > > > > > > HDFS
> > creates
> > > > multiple
> > > > > > replicas of data
> > > > > > > > blocks for
> > > > > > > > > > > >
> > reliability,
> > > > placing
> > > > > > > > > > > > them on
> > compute
> > > > nodes around
> > > > > > the
> > > > > > > > cluster.
> > > > > > > > > > MapReduce can
> > > > > > > > > > > > then
> > process
> > > > > > > > > > > > the data
> > where it
> > > > is located.
> > > > > > > > > > > >
> > > > > > > > > > > > The
> > Hadoop
> > > > Map-Reduce
> > > > > > framework is
> > > > > > > > quite good at
> > > > > > > > > > scheduling
> > > > > > > > > > > > your
> > > > > > > > > > > >
> > 'maps' on
> > > > the actual
> > > > > > data-nodes
> > > > > > > > where the
> > > > > > > > > > > >
> > input-blocks are
> > > > present,
> > > > > > > > > > > > leading
> > to i/o
> > > > > > efficiencies...
> > > > > > > > > > > >
> > > > > > > > > > > > Arun
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > Thanks.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > Terrence A.
> > > > Pietrondi
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > >
> > >
> > >
> > >
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message