Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 15503 invoked from network); 5 Apr 2009 20:06:02 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 5 Apr 2009 20:06:02 -0000 Received: (qmail 32053 invoked by uid 500); 5 Apr 2009 20:06:00 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 31968 invoked by uid 500); 5 Apr 2009 20:06:00 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 31958 invoked by uid 99); 5 Apr 2009 20:05:59 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 05 Apr 2009 20:05:59 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of niteshbhatia008@gmail.com designates 209.85.221.134 as permitted sender) Received: from [209.85.221.134] (HELO mail-qy0-f134.google.com) (209.85.221.134) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 05 Apr 2009 20:05:50 +0000 Received: by qyk40 with SMTP id 40so4599478qyk.5 for ; Sun, 05 Apr 2009 13:05:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=N7Kb+fwldhRfLTFJDTSnbjE4xeYa6SBGkStoOSQLSgI=; b=rx1hwAPXlqnw5UoUP62LtlJIBqmrdXLEVzb+DoGEMQETTTM4uz0JhYLicJB5+I+6sI YqEikG+P2RHqakI9BQxKqs7wUGeEgIGtp/3BVu2hVpwoyZmDhih4L3zulDU+lRoHNc9A VEB+etCQewevlPr6sgh/Ak6TdgPfuVrLIYlzc= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=N4RLZOrMkH3wTtWcgbBGeygAhnhiQY1u8EJ521YYhZuDB+yQpOvuaHfu7Gd67KnKgR rVosB4WpdNIfwn/BdEFV24MANc6jqeSGDKnonmMizFKn8JOMZ0EM0HCvinBRgGEU5K1E lr8EzAJupEQhEzok3FhYbtgQHYFAJwIm78/CI= MIME-Version: 1.0 Received: by 10.229.86.196 with SMTP id t4mr780990qcl.39.1238961929574; Sun, 05 Apr 2009 13:05:29 -0700 (PDT) In-Reply-To: <314098690904051102r237b0836y109de759ab9634a@mail.gmail.com> References: <49D7CD04.7030708@nbi.dk> <314098690904041658j4417ee98v30c85263383eb4e9@mail.gmail.com> <49D8C378.7020402@nbi.dk> <314098690904051102r237b0836y109de759ab9634a@mail.gmail.com> Date: Mon, 6 Apr 2009 01:35:29 +0530 Message-ID: Subject: Re: joining two large files in hadoop From: nitesh bhatia To: core-user@hadoop.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Hi Pig (Hadoop-subproject) can serve the best option for these kind of problems. I suggest you to take a look. --nitesh On Sun, Apr 5, 2009 at 11:32 PM, jason hadoop wrot= e: > Alpha chapters are available, and 8 should be available in the alpha's as > soon as draft one gets back from technical review. > > On Sun, Apr 5, 2009 at 7:43 AM, Christian Ulrik S=F8ttrup wrote: > >> jason hadoop wrote: >> >>> This is discussed in chapter 8 of my book. >>> >>> >> What book? Is it out? >> >> =A0In short, >>> If both data sets are: >>> >>> =A0 - in same key order >>> =A0 - partitioned with the same partitioner, >>> =A0 - the input format of each data set is the same, (necessary for thi= s >>> =A0 simple example only) >>> >>> A map side join will present all the key value pairs of each partition,= to >>> a >>> single map task, in key order, >>> Path dir1 =3D=3D the directory containing the part-XXXXX files for data= set 1 >>> Path dir2 =3D=3D The directory containing the part-XXXXX files for data= set 2 >>> and use CompositeInputFormat.compose to build the join statement >>> >>> set the InputFormat to CompositeInputFormat, >>> conf.setInputFormat(CompositeInputFormat.class); >>> >>> String joinStatement =3D CompositeInputFormat.compose("inner", dir1, di= r2); >>> conf.set('mapred.join.expr", joinStatement); >>> >>> The value classfor your map method will be TupleWritable >>> In the map method, >>> >>> =A0 - value.has(x) indicates if the Xth ordinal data set has a value fo= r >>> this >>> =A0 key >>> =A0 - value.get(x) returns the value from the Xth ordinal data set for = this >>> =A0 key >>> =A0 - value.size() returns the number of data sets in the join >>> >>> In our example, dir1 would be ordinal 0, and dir2 would be ordinal 1. >>> >>> >> The partitioner is normally used for the reduce step but here it will be >> used already at the mapper stage? >> >> Basically my files look like: >> idmatrix >> id2anothermatrix >> and >> idvector1 >> idvector2 >> id2vector3 >> >> id is just an integer and there is only one matrix but many vectors tied= to >> the same id. >> I just want the values from both files that has the same id. >> Do I need a partitioner in this case? What happens if the file is split >> into blocks such that two blocks >> contain entries with the same key? >> >> Am I right if what happens is that using the example above the mapper wi= ll >> be called three times with: >> key=3Did =A0 tuple=3D(matrix,vector1) >> key=3Did =A0 tuple=3D(matrix,vector2) >> key=3Did2 tuple=3D(anothermatix,vector3) >> >> cheers, >> Christian >> >> > > > -- > Alpha Chapters of my book on Hadoop are available > http://www.apress.com/book/view/9781430219422 > --=20 Nitesh Bhatia Dhirubhai Ambani Institute of Information & Communication Technology Gandhinagar Gujarat "Life is never perfect. It just depends where you draw the line." visit: http://www.awaaaz.com - connecting through music http://www.volstreet.com - lets volunteer for better tomorrow http://www.instibuzz.com - Voice opinions, Transact easily, Have fun