Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9AEF87688 for ; Thu, 14 Jul 2011 22:15:14 +0000 (UTC) Received: (qmail 29197 invoked by uid 500); 14 Jul 2011 22:15:10 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 28805 invoked by uid 500); 14 Jul 2011 22:15:09 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 28789 invoked by uid 99); 14 Jul 2011 22:15:09 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Jul 2011 22:15:09 +0000 X-ASF-Spam-Status: No, hits=-5.0 required=5.0 tests=RCVD_IN_DNSWL_HI,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of matthew.goeke@monsanto.com designates 164.144.240.27 as permitted sender) Received: from [164.144.240.27] (HELO gateway2.monsanto.com) (164.144.240.27) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Jul 2011 22:15:03 +0000 X-IronPort-AV: E=Sophos;i="4.65,531,1304312400"; d="scan'208";a="41382919" Received: from unknown (HELO NA1000EXR01.na.ds.monsanto.com) ([10.29.223.249]) by gateway2.monsanto.com with ESMTP; 14 Jul 2011 17:14:30 -0500 Received: from NA1000EXR01.na.ds.monsanto.com ([10.30.64.43]) by NA1000EXR01.na.ds.monsanto.com with Microsoft SMTPSVC(6.0.3790.4675); Thu, 14 Jul 2011 17:14:42 -0500 Received: from stlwexhubprd04.na.ds.monsanto.com ([10.30.58.188]) by NA1000EXR01.na.ds.monsanto.com with Microsoft SMTPSVC(6.0.3790.4675); Thu, 14 Jul 2011 17:14:42 -0500 Received: from stlwexchhubp01.na.ds.monsanto.com (10.30.58.178) by stlwexhubprd04.na.ds.monsanto.com (10.30.58.188) with Microsoft SMTP Server (TLS) id 14.1.255.0; Thu, 14 Jul 2011 17:14:41 -0500 Received: from stlwexmbxprd04.na.ds.monsanto.com ([169.254.7.41]) by stlwexchhubp01.na.ds.monsanto.com ([10.30.58.178]) with mapi id 14.01.0255.000; Thu, 14 Jul 2011 17:14:28 -0500 From: "GOEKE, MATTHEW (AG/1000)" To: "common-user@hadoop.apache.org" , "mapreduce-user@hadoop.apache.org" CC: "GOEKE, MATTHEW (AG/1000)" Subject: Issue with MR code not scaling correctly with data sizes Thread-Topic: Issue with MR code not scaling correctly with data sizes Thread-Index: AQHMQiZLp07usybDG0aMcJD6BtgK4pTsKEWAgAAzQxA= Date: Thu, 14 Jul 2011 22:14:29 +0000 Message-ID: References: <4E1EE8C6.70604@orkash.com> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-cr-hashedpuzzle: 0fs= CBIn CgJW DRoJ FDFM F9SX H1N3 IyrQ KjBU Kyla LA5e Lu39 Ru1L SZLC TYsY U2t/;2;YwBvAG0AbQBvAG4ALQB1AHMAZQByAEAAaABhAGQAbwBvAHAALgBhAHAAYQBjAGgAZQAuAG8AcgBnADsAbQBhAHAAcgBlAGQAdQBjAGUALQB1AHMAZQByAEAAaABhAGQAbwBvAHAALgBhAHAAYQBjAGgAZQAuAG8AcgBnAA==;Sosha1_v1;7;{46F992CE-05D6-4B03-9A15-37C68BACB06F};bQBhAHQAdABoAGUAdwAuAGcAbwBlAGsAZQBAAG0AbwBuAHMAYQBuAHQAbwAuAGMAbwBtAA==;Thu, 14 Jul 2011 22:14:36 GMT;SQBzAHMAdQBlACAAdwBpAHQAaAAgAE0AUgAgAGMAbwBkAGUAIABuAG8AdAAgAHMAYwBhAGwAaQBuAGcAIABjAG8AcgByAGUAYwB0AGwAeQAgAHcAaQB0AGgAIABkAGEAdABhACAAcwBpAHoAZQBzAA== x-cr-puzzleid: {46F992CE-05D6-4B03-9A15-37C68BACB06F} x-originating-ip: [10.30.3.245] Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 X-OriginalArrivalTime: 14 Jul 2011 22:14:42.0217 (UTC) FILETIME=[6E10D190:01CC4273] Content-Transfer-Encoding: quoted-printable All, I have a MR program that I feed in a list of IDs and it generates the uniqu= e comparison set as a result. Example: if I have a list {1,2,3,4,5} then th= e resulting output would be {2x1, 3x2, 3x1, 4x3, 4x2, 4x1, 5x4, 5x3, 5x2, 5= x1} or (n^2-n)/2 number of comparisons. My code works just fine on smaller = scaled sets (I can verify less than 1000 fairly easily) but fails when I tr= y to push the set to 10-20k IDs which is annoying when the end goal is 1-10= million. The flow of the program is: 1) Partition the IDs evenly, based on amount of output per value, into a s= et of keys equal to the number of reduce slots we currently have 2) Use the distributed cache to push the ID file out to the various reduce= rs 3) In the setup of the reducer, populate an int array with the values from= the ID file in distributed cache 4) Output a comparison only if the current ID from the values iterator is = greater than the current iterator through the int array I realize that this could be done many other ways but this will be part of = an oozie workflow so it made sense to just do it in MR for now. My issue is= that when I try the larger sized ID files it only outputs part of resultin= g data set and there are no errors to be found. Part of me thinks that I ne= ed to tweak some site configuration properties, due to the size of data tha= t is spilling to disk, but after scanning through all 3 sites I am having i= ssues pin pointing anything I think could be causing this. I moved from rea= ding the file from HDFS to using the distributed cache for the join read th= inking that might solve my problem but there seems to be something else I a= m overlooking. Any advice is greatly appreciated! Matt This e-mail message may contain privileged and/or confidential information,= and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, ple= ase notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use= of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, re= ading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checki= ng for the presence of "Viruses" or other "Malware". Monsanto, along with its subsidiaries, accepts no liability for any damage = caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export contro= l laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) an= d sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this = information you are obligated to comply with all applicable U.S. export laws and regulations.