Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: local policy)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws;
  s=s1024; d=yahoo.com;
  h=X-YMail-OSG:Received:X-Mailer:Date:From:Subject:To:In-Reply-To:MIME-Version:Content-Type:Message-ID;
  b=0bMW8Wf7/XSubHItjuwkXiJfVAM9X1qmLaQ5QmxHwTCEf+vRgbBkTYgGVgLM+rpAJvcGzN5hJXNPQCE3YaQxRRBc1vud0TUW1ioOgb2d2KnA+EKMkiMDjDeGE3R+shJgEdAtZ6z/B35zcC2qTdi6VJapo8dzvUrp8z+lnUNxdPk=;
Date: Mon, 6 Oct 2008 06:38:55 -0700 (PDT)
From: "Terrence A. Pietrondi" <tepietrondi@yahoo.com>
Subject: Re: architecture diagram
To: core-user@hadoop.apache.org
In-Reply-To: <c5c51be00810051826j742e90afradf329d63f9aa772@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Message-ID: <610101.59089.qm@web35704.mail.mud.yahoo.com>

Can you explain "The location of these splits is semi-arbitrary"? What if the example was...

AAA|BBB|CCC|DDD
EEE|FFF|GGG|HHH


Does this mean the split might be between CCC such that it results in AAA|BBB|C and C|DDD for the first line? Is there a way to control this behavior to split on my delimiter?


Terrence A. Pietrondi


--- On Sun, 10/5/08, Alex Loddengaard <alexloddengaard@gmail.com> wrote:

> From: Alex Loddengaard <alexloddengaard@gmail.com>
> Subject: Re: architecture diagram
> To: core-user@hadoop.apache.org
> Date: Sunday, October 5, 2008, 9:26 PM
> Let's say you have one very large input file of the
> form:
> 
> A|B|C|D
> E|F|G|H
> ...
> |1|2|3|4
> 
> This input file will be broken up into N pieces, where N is
> the number of
> mappers that run.  The location of these splits is
> semi-arbitrary.  This
> means that unless you have one mapper, you won't be
> able to see the entire
> contents of a column in your mapper.  Given that you would
> need one mapper
> to be able to see the entirety of a column, you've now
> essentially reduced
> your problem to a single machine.
> 
> You may want to play with the following idea: collect key
> => column_number
> and value => column_contents in your map step.  This
> means that you would be
> able to see the entirety of a column in your reduce step,
> though you're
> still faced with the tasks of shuffling and re-pivoting.
> 
> Does this clear up your confusion?  Let me know if
> you'd like me to clarify
> more.
> 
> Alex
> 
> On Sun, Oct 5, 2008 at 3:54 PM, Terrence A. Pietrondi
> <tepietrondi@yahoo.com
> > wrote:
> 
> > I am not sure why this doesn't fit, maybe you can
> help me understand. Your
> > previous comment was...
> >
> > "The reason I'm making this claim is because
> in order to do the pivot
> > operation you must know about every row. Your input
> files will be split at
> > semi-arbitrary places, essentially making it
> impossible for each mapper to
> > know every single row."
> >
> > Are you saying that my row segments might not actually
> be the entire row so
> > I will get a bad key index? If so, would the row
> segments be determined? I
> > based my initial work off of the word count example,
> where the lines are
> > tokenized. Does this mean in this example the row
> tokens may not be the
> > complete row?
> >
> > Thanks.
> >
> > Terrence A. Pietrondi
> >
> >
> > --- On Fri, 10/3/08, Alex Loddengaard
> <alexloddengaard@gmail.com> wrote:
> >
> > > From: Alex Loddengaard
> <alexloddengaard@gmail.com>
> > > Subject: Re: architecture diagram
> > > To: core-user@hadoop.apache.org
> > > Date: Friday, October 3, 2008, 7:14 PM
> > > The approach that you've described does not
> fit well in
> > > to the MapReduce
> > > paradigm.  You may want to consider randomizing
> your data
> > > in a different
> > > way.
> > >
> > > Unfortunately some things can't be solved
> well with
> > > MapReduce, and I think
> > > this is one of them.
> > >
> > > Can someone else say more?
> > >
> > > Alex
> > >
> > > On Fri, Oct 3, 2008 at 8:15 AM, Terrence A.
> Pietrondi
> > > <tepietrondi@yahoo.com
> > > > wrote:
> > >
> > > > Sorry for the confusion, I did make some
> typos. My
> > > example should have
> > > > looked like...
> > > >
> > > > > A|B|C
> > > > > D|E|G
> > > > >
> > > > > pivots too...
> > > > >
> > > > > D|A
> > > > > E|B
> > > > > G|C
> > > > >
> > > > > Then for each row, shuffle the contents
> around
> > > randomly...
> > > > >
> > > > > D|A
> > > > > B|E
> > > > > C|G
> > > > >
> > > > > Then pivot the data back...
> > > > >
> > > > > A|E|G
> > > > > D|B|C
> > > >
> > > > The general goal is to shuffle the elements
> in each
> > > column in the input
> > > > data. Meaning, the ordering of the elements
> in each
> > > column will not be the
> > > > same as in input.
> > > >
> > > > If you look at the initial input and compare
> to the
> > > final output, you'll
> > > > see that during the shuffling, B and E are
> swapped,
> > > and G and C are swapped,
> > > > while A and D were shuffled back into their
> > > originating positions in the
> > > > column.
> > > >
> > > > Once again, sorry for the typos and
> confusion.
> > > >
> > > > Terrence A. Pietrondi
> > > >
> > > > --- On Fri, 10/3/08, Alex Loddengaard
> > > <alexloddengaard@gmail.com> wrote:
> > > >
> > > > > From: Alex Loddengaard
> > > <alexloddengaard@gmail.com>
> > > > > Subject: Re: architecture diagram
> > > > > To: core-user@hadoop.apache.org
> > > > > Date: Friday, October 3, 2008, 11:01 AM
> > > > > Can you confirm that the example
> you've
> > > presented is
> > > > > accurate?  I think you
> > > > > may have made some typos, because the
> letter
> > > "G"
> > > > > isn't in the final result;
> > > > > I also think your first pivot
> accidentally
> > > swapped C and G.
> > > > >  I'm having a
> > > > > hard time understanding what you want
> to do,
> > > because it
> > > > > seems like your
> > > > > operations differ from your example.
> > > > >
> > > > > With that said, at first glance, this
> problem may
> > > not fit
> > > > > well in to the
> > > > > MapReduce paradigm.  The reason I'm
> making
> > > this claim
> > > > > is because in order to
> > > > > do the pivot operation you must know
> about every
> > > row.  Your
> > > > > input files will
> > > > > be split at semi-arbitrary places,
> essentially
> > > making it
> > > > > impossible for each
> > > > > mapper to know every single row.  There
> may be a
> > > way to do
> > > > > this by
> > > > > collecting, in your map step, key =>
> column
> > > number (0,
> > > > > 1, 2, etc) and value
> > > > > => (A, B, C, etc), though you may
> run in to
> > > problems
> > > > > when you try to pivot
> > > > > back.  I say this because when you
> pivot back,
> > > you need to
> > > > > have each column,
> > > > > which means you'll need one reduce
> step.
> > > There may be
> > > > > a way to put the
> > > > > pivot-back operation in a second
> iteration,
> > > though I
> > > > > don't think that would
> > > > > help you.
> > > > >
> > > > > Terrence, please confirm that
> you've defined
> > > your
> > > > > example correctly.  In the
> > > > > meantime, can someone else confirm that
> this
> > > problem does
> > > > > not fit will in to
> > > > > the MapReduce paradigm?
> > > > >
> > > > > Alex
> > > > >
> > > > > On Thu, Oct 2, 2008 at 10:48 AM,
> Terrence A.
> > > Pietrondi <
> > > > > tepietrondi@yahoo.com> wrote:
> > > > >
> > > > > > I am trying to write a map reduce
> > > implementation to do
> > > > > the following:
> > > > > >
> > > > > > 1) read tabular data delimited in
> some
> > > fashion
> > > > > > 2) pivot that data, so the rows
> are columns
> > > and the
> > > > > columns are rows
> > > > > > 3) shuffle the rows (that were the
> columns)
> > > to
> > > > > randomize the data
> > > > > > 4) pivot the data back
> > > > > >
> > > > > > For example.....
> > > > > >
> > > > > > A|B|C
> > > > > > D|E|G
> > > > > >
> > > > > > pivots too...
> > > > > >
> > > > > > D|A
> > > > > > E|B
> > > > > > C|G
> > > > > >
> > > > > > Then for each row, shuffle the
> contents
> > > around
> > > > > randomly...
> > > > > >
> > > > > > D|A
> > > > > > B|E
> > > > > > G|C
> > > > > >
> > > > > > Then pivot the data back...
> > > > > >
> > > > > > A|E|C
> > > > > > D|B|C
> > > > > >
> > > > > > You can reference my progress so
> far...
> > > > > >
> > > > > >
> > > > >
> > >
> http://svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/
> > > > > >
> > > > > > Terrence A. Pietrondi
> > > > > >
> > > > > >
> > > > > > --- On Thu, 10/2/08, Alex
> Loddengaard
> > > > > <alexloddengaard@gmail.com>
> wrote:
> > > > > >
> > > > > > > From: Alex Loddengaard
> > > > > <alexloddengaard@gmail.com>
> > > > > > > Subject: Re: architecture
> diagram
> > > > > > > To:
> core-user@hadoop.apache.org
> > > > > > > Date: Thursday, October 2,
> 2008, 1:36
> > > PM
> > > > > > > I think it really depends on
> the job as
> > > to where
> > > > > logic goes.
> > > > > > >  Sometimes your
> > > > > > > reduce step is as simple as
> an identify
> > > function,
> > > > > and
> > > > > > > sometimes it can be
> > > > > > > more complex than your map
> step.  It
> > > all depends
> > > > > on your
> > > > > > > data and the
> > > > > > > operation(s) you're
> trying to
> > > perform.
> > > > > > >
> > > > > > > Perhaps we should step out of
> the
> > > abstract.  Do
> > > > > you have a
> > > > > > > specific problem
> > > > > > > you're trying to solve? 
> Can you
> > > describe it?
> > > > > > >
> > > > > > > Alex
> > > > > > >
> > > > > > > On Thu, Oct 2, 2008 at 4:55
> AM,
> > > Terrence A.
> > > > > Pietrondi
> > > > > > > <tepietrondi@yahoo.com
> > > > > > > > wrote:
> > > > > > >
> > > > > > > > I am sorry for the
> confusion. I
> > > meant
> > > > > distributed
> > > > > > > data.
> > > > > > > >
> > > > > > > > So help me out here. For
> example,
> > > if I am
> > > > > reducing to
> > > > > > > a single file, then
> > > > > > > > my main transformation
> logic would
> > > be in my
> > > > > mapping
> > > > > > > step since I am reducing
> > > > > > > > away from the data?
> > > > > > > >
> > > > > > > > Terrence A. Pietrondi
> > > > > > > >
> http://del.icio.us/tepietrondi
> > > > > > > >
> > > > > > > >
> > > > > > > > --- On Wed, 10/1/08,
> Alex
> > > Loddengaard
> > > > > > >
> <alexloddengaard@gmail.com>
> > > wrote:
> > > > > > > >
> > > > > > > > > From: Alex
> Loddengaard
> > > > > > >
> <alexloddengaard@gmail.com>
> > > > > > > > > Subject: Re:
> architecture
> > > diagram
> > > > > > > > > To:
> > > core-user@hadoop.apache.org
> > > > > > > > > Date: Wednesday,
> October 1,
> > > 2008, 7:44
> > > > > PM
> > > > > > > > > I'm not sure
> what you
> > > mean by
> > > > > > > "disconnected parts
> > > > > > > > > of data," but
> Hadoop is
> > > > > > > > > implemented to try
> and
> > > perform map
> > > > > tasks on
> > > > > > > machines that
> > > > > > > > > have input data.
> > > > > > > > > This is to lower
> the amount
> > > of network
> > > > > traffic,
> > > > > > > hence
> > > > > > > > > making the entire
> job
> > > > > > > > > run faster.  Hadoop
> does all
> > > this for
> > > > > you under
> > > > > > > the hood.
> > > > > > > > > From a user's
> > > > > > > > > point of view, all
> you need
> > > to do is
> > > > > store data
> > > > > > > in HDFS
> > > > > > > > > (the distributed
> > > > > > > > > filesystem), and
> run
> > > MapReduce jobs on
> > > > > that data.
> > > > > > >  Take a
> > > > > > > > > look here:
> > > > > > > > >
> > > > > > > > >
> > > > >
> <http://wiki.apache.org/hadoop/WordCount>
> > > > > > > > >
> > > > > > > > > Alex
> > > > > > > > >
> > > > > > > > > On Wed, Oct 1, 2008
> at 1:11
> > > PM,
> > > > > Terrence A.
> > > > > > > Pietrondi
> > > > > > > > >
> <tepietrondi@yahoo.com
> > > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > So to be
> > > "distributed"
> > > > > in a sense,
> > > > > > > you would
> > > > > > > > > want to do your
> computation
> > > on
> > > > > > > > > > the
> disconnected parts
> > > of data in
> > > > > the map
> > > > > > > phase I
> > > > > > > > > would guess?
> > > > > > > > > >
> > > > > > > > > > Terrence A.
> Pietrondi
> > > > > > > > > >
> > > http://del.icio.us/tepietrondi
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --- On Wed,
> 10/1/08,
> > > Arun C Murthy
> > > > > > > > >
> <acm@yahoo-inc.com>
> > > wrote:
> > > > > > > > > >
> > > > > > > > > > > From:
> Arun C Murthy
> > > > > > > <acm@yahoo-inc.com>
> > > > > > > > > > > Subject:
> Re:
> > > architecture
> > > > > diagram
> > > > > > > > > > > To:
> > > > > core-user@hadoop.apache.org
> > > > > > > > > > > Date:
> Wednesday,
> > > October 1,
> > > > > 2008, 2:16
> > > > > > > PM
> > > > > > > > > > > On Oct 1,
> 2008, at
> > > 10:17 AM,
> > > > > Terrence
> > > > > > > A.
> > > > > > > > > Pietrondi wrote:
> > > > > > > > > > >
> > > > > > > > > > > > I am
> trying to
> > > plan out
> > > > > my
> > > > > > > map-reduce
> > > > > > > > > implementation
> > > > > > > > > > > and I
> have some
> > > > > > > > > > > >
> questions of
> > > where
> > > > > computation
> > > > > > > should be
> > > > > > > > > split in
> > > > > > > > > > > order to
> take
> > > > > > > > > > > >
> advantage of
> > > the
> > > > > distributed
> > > > > > > nodes.
> > > > > > > > > > > >
> > > > > > > > > > > >
> Looking at the
> > > > > architecture
> > > > > > > diagram
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> (http://hadoop.apache.org/core/images/architecture.gif
> > > > > > > > > > > > ),
> are the map
> > > boxes the
> > > > > major
> > > > > > > computation
> > > > > > > > > areas or is
> > > > > > > > > > > the
> reduce
> > > > > > > > > > > > the
> major
> > > computation
> > > > > area?
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Usually
> the maps
> > > perform the
> > > > > > > 'embarrassingly
> > > > > > > > > > >
> parallel'
> > > computational
> > > > > > > > > > > steps
> where-in each
> > > map works
> > > > > > > independently on a
> > > > > > > > > > >
> 'split' on
> > > your input
> > > > > > > > > > > and the
> reduces
> > > perform the
> > > > > > > 'aggregate'
> > > > > > > > > > >
> computations.
> > > > > > > > > > >
> > > > > > > > > > >  From
> > > > > http://hadoop.apache.org/core/ :
> > > > > > > > > > >
> > > > > > > > > > > Hadoop
> implements
> > > MapReduce,
> > > > > using the
> > > > > > > Hadoop
> > > > > > > > > Distributed
> > > > > > > > > > > File
> System
> > > > > > > > > > > (HDFS).
> MapReduce
> > > divides
> > > > > applications
> > > > > > > into many
> > > > > > > > > small
> > > > > > > > > > > blocks of
> work.
> > > > > > > > > > > HDFS
> creates
> > > multiple
> > > > > replicas of data
> > > > > > > blocks for
> > > > > > > > > > >
> reliability,
> > > placing
> > > > > > > > > > > them on
> compute
> > > nodes around
> > > > > the
> > > > > > > cluster.
> > > > > > > > > MapReduce can
> > > > > > > > > > > then
> process
> > > > > > > > > > > the data
> where it
> > > is located.
> > > > > > > > > > >
> > > > > > > > > > > The
> Hadoop
> > > Map-Reduce
> > > > > framework is
> > > > > > > quite good at
> > > > > > > > > scheduling
> > > > > > > > > > > your
> > > > > > > > > > >
> 'maps' on
> > > the actual
> > > > > data-nodes
> > > > > > > where the
> > > > > > > > > > >
> input-blocks are
> > > present,
> > > > > > > > > > > leading
> to i/o
> > > > > efficiencies...
> > > > > > > > > > >
> > > > > > > > > > > Arun
> > > > > > > > > > >
> > > > > > > > > > > >
> Thanks.
> > > > > > > > > > > >
> > > > > > > > > > > >
> Terrence A.
> > > Pietrondi
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > >
> > > >
> > > >
> > > >
> >
> >
> >
> >