Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 42794 invoked from network); 6 Oct 2008 13:40:34 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 6 Oct 2008 13:40:34 -0000 Received: (qmail 10294 invoked by uid 500); 6 Oct 2008 13:40:27 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 10251 invoked by uid 500); 6 Oct 2008 13:40:27 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 10240 invoked by uid 99); 6 Oct 2008 13:40:27 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 06 Oct 2008 06:40:27 -0700 X-ASF-Spam-Status: No, hits=0.1 required=10.0 tests=DNS_FROM_SECURITYSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [66.163.179.158] (HELO web35704.mail.mud.yahoo.com) (66.163.179.158) by apache.org (qpsmtpd/0.29) with SMTP; Mon, 06 Oct 2008 13:39:22 +0000 Received: (qmail 59209 invoked by uid 60001); 6 Oct 2008 13:38:55 -0000 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=X-YMail-OSG:Received:X-Mailer:Date:From:Subject:To:In-Reply-To:MIME-Version:Content-Type:Message-ID; b=0bMW8Wf7/XSubHItjuwkXiJfVAM9X1qmLaQ5QmxHwTCEf+vRgbBkTYgGVgLM+rpAJvcGzN5hJXNPQCE3YaQxRRBc1vud0TUW1ioOgb2d2KnA+EKMkiMDjDeGE3R+shJgEdAtZ6z/B35zcC2qTdi6VJapo8dzvUrp8z+lnUNxdPk=; X-YMail-OSG: xMcE0pYVM1m2rKOHxMgccrNeq3.0gKPLpoH5IKPcyckHes0TRfXbggPkzdh_fo2e.l5OpduzMVvOupwQbXJv7O.ydumvYE5q1G9aRnkWzaA.4MxPcechgHnGO1EhYD5tf2FDcbexHDBjFDyzRnl6JfXoyA-- Received: from [207.93.98.10] by web35704.mail.mud.yahoo.com via HTTP; Mon, 06 Oct 2008 06:38:55 PDT X-Mailer: YahooMailWebService/0.7.218.2 Date: Mon, 6 Oct 2008 06:38:55 -0700 (PDT) From: "Terrence A. Pietrondi" Subject: Re: architecture diagram To: core-user@hadoop.apache.org In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Message-ID: <610101.59089.qm@web35704.mail.mud.yahoo.com> X-Virus-Checked: Checked by ClamAV on apache.org Can you explain "The location of these splits is semi-arbitrary"? What if the example was... AAA|BBB|CCC|DDD EEE|FFF|GGG|HHH Does this mean the split might be between CCC such that it results in AAA|BBB|C and C|DDD for the first line? Is there a way to control this behavior to split on my delimiter? Terrence A. Pietrondi --- On Sun, 10/5/08, Alex Loddengaard wrote: > From: Alex Loddengaard > Subject: Re: architecture diagram > To: core-user@hadoop.apache.org > Date: Sunday, October 5, 2008, 9:26 PM > Let's say you have one very large input file of the > form: > > A|B|C|D > E|F|G|H > ... > |1|2|3|4 > > This input file will be broken up into N pieces, where N is > the number of > mappers that run. The location of these splits is > semi-arbitrary. This > means that unless you have one mapper, you won't be > able to see the entire > contents of a column in your mapper. Given that you would > need one mapper > to be able to see the entirety of a column, you've now > essentially reduced > your problem to a single machine. > > You may want to play with the following idea: collect key > => column_number > and value => column_contents in your map step. This > means that you would be > able to see the entirety of a column in your reduce step, > though you're > still faced with the tasks of shuffling and re-pivoting. > > Does this clear up your confusion? Let me know if > you'd like me to clarify > more. > > Alex > > On Sun, Oct 5, 2008 at 3:54 PM, Terrence A. Pietrondi > > wrote: > > > I am not sure why this doesn't fit, maybe you can > help me understand. Your > > previous comment was... > > > > "The reason I'm making this claim is because > in order to do the pivot > > operation you must know about every row. Your input > files will be split at > > semi-arbitrary places, essentially making it > impossible for each mapper to > > know every single row." > > > > Are you saying that my row segments might not actually > be the entire row so > > I will get a bad key index? If so, would the row > segments be determined? I > > based my initial work off of the word count example, > where the lines are > > tokenized. Does this mean in this example the row > tokens may not be the > > complete row? > > > > Thanks. > > > > Terrence A. Pietrondi > > > > > > --- On Fri, 10/3/08, Alex Loddengaard > wrote: > > > > > From: Alex Loddengaard > > > > Subject: Re: architecture diagram > > > To: core-user@hadoop.apache.org > > > Date: Friday, October 3, 2008, 7:14 PM > > > The approach that you've described does not > fit well in > > > to the MapReduce > > > paradigm. You may want to consider randomizing > your data > > > in a different > > > way. > > > > > > Unfortunately some things can't be solved > well with > > > MapReduce, and I think > > > this is one of them. > > > > > > Can someone else say more? > > > > > > Alex > > > > > > On Fri, Oct 3, 2008 at 8:15 AM, Terrence A. > Pietrondi > > > > > > wrote: > > > > > > > Sorry for the confusion, I did make some > typos. My > > > example should have > > > > looked like... > > > > > > > > > A|B|C > > > > > D|E|G > > > > > > > > > > pivots too... > > > > > > > > > > D|A > > > > > E|B > > > > > G|C > > > > > > > > > > Then for each row, shuffle the contents > around > > > randomly... > > > > > > > > > > D|A > > > > > B|E > > > > > C|G > > > > > > > > > > Then pivot the data back... > > > > > > > > > > A|E|G > > > > > D|B|C > > > > > > > > The general goal is to shuffle the elements > in each > > > column in the input > > > > data. Meaning, the ordering of the elements > in each > > > column will not be the > > > > same as in input. > > > > > > > > If you look at the initial input and compare > to the > > > final output, you'll > > > > see that during the shuffling, B and E are > swapped, > > > and G and C are swapped, > > > > while A and D were shuffled back into their > > > originating positions in the > > > > column. > > > > > > > > Once again, sorry for the typos and > confusion. > > > > > > > > Terrence A. Pietrondi > > > > > > > > --- On Fri, 10/3/08, Alex Loddengaard > > > wrote: > > > > > > > > > From: Alex Loddengaard > > > > > > > > Subject: Re: architecture diagram > > > > > To: core-user@hadoop.apache.org > > > > > Date: Friday, October 3, 2008, 11:01 AM > > > > > Can you confirm that the example > you've > > > presented is > > > > > accurate? I think you > > > > > may have made some typos, because the > letter > > > "G" > > > > > isn't in the final result; > > > > > I also think your first pivot > accidentally > > > swapped C and G. > > > > > I'm having a > > > > > hard time understanding what you want > to do, > > > because it > > > > > seems like your > > > > > operations differ from your example. > > > > > > > > > > With that said, at first glance, this > problem may > > > not fit > > > > > well in to the > > > > > MapReduce paradigm. The reason I'm > making > > > this claim > > > > > is because in order to > > > > > do the pivot operation you must know > about every > > > row. Your > > > > > input files will > > > > > be split at semi-arbitrary places, > essentially > > > making it > > > > > impossible for each > > > > > mapper to know every single row. There > may be a > > > way to do > > > > > this by > > > > > collecting, in your map step, key => > column > > > number (0, > > > > > 1, 2, etc) and value > > > > > => (A, B, C, etc), though you may > run in to > > > problems > > > > > when you try to pivot > > > > > back. I say this because when you > pivot back, > > > you need to > > > > > have each column, > > > > > which means you'll need one reduce > step. > > > There may be > > > > > a way to put the > > > > > pivot-back operation in a second > iteration, > > > though I > > > > > don't think that would > > > > > help you. > > > > > > > > > > Terrence, please confirm that > you've defined > > > your > > > > > example correctly. In the > > > > > meantime, can someone else confirm that > this > > > problem does > > > > > not fit will in to > > > > > the MapReduce paradigm? > > > > > > > > > > Alex > > > > > > > > > > On Thu, Oct 2, 2008 at 10:48 AM, > Terrence A. > > > Pietrondi < > > > > > tepietrondi@yahoo.com> wrote: > > > > > > > > > > > I am trying to write a map reduce > > > implementation to do > > > > > the following: > > > > > > > > > > > > 1) read tabular data delimited in > some > > > fashion > > > > > > 2) pivot that data, so the rows > are columns > > > and the > > > > > columns are rows > > > > > > 3) shuffle the rows (that were the > columns) > > > to > > > > > randomize the data > > > > > > 4) pivot the data back > > > > > > > > > > > > For example..... > > > > > > > > > > > > A|B|C > > > > > > D|E|G > > > > > > > > > > > > pivots too... > > > > > > > > > > > > D|A > > > > > > E|B > > > > > > C|G > > > > > > > > > > > > Then for each row, shuffle the > contents > > > around > > > > > randomly... > > > > > > > > > > > > D|A > > > > > > B|E > > > > > > G|C > > > > > > > > > > > > Then pivot the data back... > > > > > > > > > > > > A|E|C > > > > > > D|B|C > > > > > > > > > > > > You can reference my progress so > far... > > > > > > > > > > > > > > > > > > > > > http://svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/ > > > > > > > > > > > > Terrence A. Pietrondi > > > > > > > > > > > > > > > > > > --- On Thu, 10/2/08, Alex > Loddengaard > > > > > > wrote: > > > > > > > > > > > > > From: Alex Loddengaard > > > > > > > > > > > > Subject: Re: architecture > diagram > > > > > > > To: > core-user@hadoop.apache.org > > > > > > > Date: Thursday, October 2, > 2008, 1:36 > > > PM > > > > > > > I think it really depends on > the job as > > > to where > > > > > logic goes. > > > > > > > Sometimes your > > > > > > > reduce step is as simple as > an identify > > > function, > > > > > and > > > > > > > sometimes it can be > > > > > > > more complex than your map > step. It > > > all depends > > > > > on your > > > > > > > data and the > > > > > > > operation(s) you're > trying to > > > perform. > > > > > > > > > > > > > > Perhaps we should step out of > the > > > abstract. Do > > > > > you have a > > > > > > > specific problem > > > > > > > you're trying to solve? > Can you > > > describe it? > > > > > > > > > > > > > > Alex > > > > > > > > > > > > > > On Thu, Oct 2, 2008 at 4:55 > AM, > > > Terrence A. > > > > > Pietrondi > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > I am sorry for the > confusion. I > > > meant > > > > > distributed > > > > > > > data. > > > > > > > > > > > > > > > > So help me out here. For > example, > > > if I am > > > > > reducing to > > > > > > > a single file, then > > > > > > > > my main transformation > logic would > > > be in my > > > > > mapping > > > > > > > step since I am reducing > > > > > > > > away from the data? > > > > > > > > > > > > > > > > Terrence A. Pietrondi > > > > > > > > > http://del.icio.us/tepietrondi > > > > > > > > > > > > > > > > > > > > > > > > --- On Wed, 10/1/08, > Alex > > > Loddengaard > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > From: Alex > Loddengaard > > > > > > > > > > > > > > > > > Subject: Re: > architecture > > > diagram > > > > > > > > > To: > > > core-user@hadoop.apache.org > > > > > > > > > Date: Wednesday, > October 1, > > > 2008, 7:44 > > > > > PM > > > > > > > > > I'm not sure > what you > > > mean by > > > > > > > "disconnected parts > > > > > > > > > of data," but > Hadoop is > > > > > > > > > implemented to try > and > > > perform map > > > > > tasks on > > > > > > > machines that > > > > > > > > > have input data. > > > > > > > > > This is to lower > the amount > > > of network > > > > > traffic, > > > > > > > hence > > > > > > > > > making the entire > job > > > > > > > > > run faster. Hadoop > does all > > > this for > > > > > you under > > > > > > > the hood. > > > > > > > > > From a user's > > > > > > > > > point of view, all > you need > > > to do is > > > > > store data > > > > > > > in HDFS > > > > > > > > > (the distributed > > > > > > > > > filesystem), and > run > > > MapReduce jobs on > > > > > that data. > > > > > > > Take a > > > > > > > > > look here: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Alex > > > > > > > > > > > > > > > > > > On Wed, Oct 1, 2008 > at 1:11 > > > PM, > > > > > Terrence A. > > > > > > > Pietrondi > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > So to be > > > "distributed" > > > > > in a sense, > > > > > > > you would > > > > > > > > > want to do your > computation > > > on > > > > > > > > > > the > disconnected parts > > > of data in > > > > > the map > > > > > > > phase I > > > > > > > > > would guess? > > > > > > > > > > > > > > > > > > > > Terrence A. > Pietrondi > > > > > > > > > > > > > http://del.icio.us/tepietrondi > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --- On Wed, > 10/1/08, > > > Arun C Murthy > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > From: > Arun C Murthy > > > > > > > > > > > > > > > > > > Subject: > Re: > > > architecture > > > > > diagram > > > > > > > > > > > To: > > > > > core-user@hadoop.apache.org > > > > > > > > > > > Date: > Wednesday, > > > October 1, > > > > > 2008, 2:16 > > > > > > > PM > > > > > > > > > > > On Oct 1, > 2008, at > > > 10:17 AM, > > > > > Terrence > > > > > > > A. > > > > > > > > > Pietrondi wrote: > > > > > > > > > > > > > > > > > > > > > > > I am > trying to > > > plan out > > > > > my > > > > > > > map-reduce > > > > > > > > > implementation > > > > > > > > > > > and I > have some > > > > > > > > > > > > > questions of > > > where > > > > > computation > > > > > > > should be > > > > > > > > > split in > > > > > > > > > > > order to > take > > > > > > > > > > > > > advantage of > > > the > > > > > distributed > > > > > > > nodes. > > > > > > > > > > > > > > > > > > > > > > > > > Looking at the > > > > > architecture > > > > > > > diagram > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > (http://hadoop.apache.org/core/images/architecture.gif > > > > > > > > > > > > ), > are the map > > > boxes the > > > > > major > > > > > > > computation > > > > > > > > > areas or is > > > > > > > > > > > the > reduce > > > > > > > > > > > > the > major > > > computation > > > > > area? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Usually > the maps > > > perform the > > > > > > > 'embarrassingly > > > > > > > > > > > > parallel' > > > computational > > > > > > > > > > > steps > where-in each > > > map works > > > > > > > independently on a > > > > > > > > > > > > 'split' on > > > your input > > > > > > > > > > > and the > reduces > > > perform the > > > > > > > 'aggregate' > > > > > > > > > > > > computations. > > > > > > > > > > > > > > > > > > > > > > From > > > > > http://hadoop.apache.org/core/ : > > > > > > > > > > > > > > > > > > > > > > Hadoop > implements > > > MapReduce, > > > > > using the > > > > > > > Hadoop > > > > > > > > > Distributed > > > > > > > > > > > File > System > > > > > > > > > > > (HDFS). > MapReduce > > > divides > > > > > applications > > > > > > > into many > > > > > > > > > small > > > > > > > > > > > blocks of > work. > > > > > > > > > > > HDFS > creates > > > multiple > > > > > replicas of data > > > > > > > blocks for > > > > > > > > > > > > reliability, > > > placing > > > > > > > > > > > them on > compute > > > nodes around > > > > > the > > > > > > > cluster. > > > > > > > > > MapReduce can > > > > > > > > > > > then > process > > > > > > > > > > > the data > where it > > > is located. > > > > > > > > > > > > > > > > > > > > > > The > Hadoop > > > Map-Reduce > > > > > framework is > > > > > > > quite good at > > > > > > > > > scheduling > > > > > > > > > > > your > > > > > > > > > > > > 'maps' on > > > the actual > > > > > data-nodes > > > > > > > where the > > > > > > > > > > > > input-blocks are > > > present, > > > > > > > > > > > leading > to i/o > > > > > efficiencies... > > > > > > > > > > > > > > > > > > > > > > Arun > > > > > > > > > > > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > > > > > > > > > Terrence A. > > > Pietrondi > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >