Return-Path: X-Original-To: apmail-hadoop-general-archive@minotaur.apache.org Delivered-To: apmail-hadoop-general-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5A5C5312E for ; Sat, 30 Apr 2011 20:31:48 +0000 (UTC) Received: (qmail 9903 invoked by uid 500); 30 Apr 2011 20:31:44 -0000 Delivered-To: apmail-hadoop-general-archive@hadoop.apache.org Received: (qmail 9377 invoked by uid 500); 30 Apr 2011 20:31:43 -0000 Mailing-List: contact general-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@hadoop.apache.org Delivered-To: mailing list general@hadoop.apache.org Received: (qmail 8907 invoked by uid 99); 30 Apr 2011 20:31:43 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 30 Apr 2011 20:31:43 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.220.176] (HELO mail-vx0-f176.google.com) (209.85.220.176) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 30 Apr 2011 20:31:33 +0000 Received: by vxa37 with SMTP id 37so5546249vxa.35 for ; Sat, 30 Apr 2011 13:31:12 -0700 (PDT) Received: by 10.52.174.243 with SMTP id bv19mr2303953vdc.93.1304195472146; Sat, 30 Apr 2011 13:31:12 -0700 (PDT) MIME-Version: 1.0 Received: by 10.52.160.194 with HTTP; Sat, 30 Apr 2011 13:30:52 -0700 (PDT) X-Originating-IP: [67.160.196.149] In-Reply-To: References: From: Ted Dunning Date: Sat, 30 Apr 2011 13:30:52 -0700 Message-ID: Subject: Re: questions about hadoop map reduce and compute intensive related applications To: common-user@hadoop.apache.org Cc: general@hadoop.apache.org Content-Type: multipart/alternative; boundary=bcaec51a8e0cd5c0a504a228ab30 X-Virus-Checked: Checked by ClamAV on apache.org --bcaec51a8e0cd5c0a504a228ab30 Content-Type: text/plain; charset=ISO-8859-1 On Sat, Apr 30, 2011 at 12:18 AM, elton sky wrote: > I got 2 questions: > > 1. I am wondering how hadoop MR performs when it runs compute intensive > applications, e.g. Monte carlo method compute PI. There's a example in > 0.21, > QuasiMonteCarlo, but that example doesn't use random number and it > generates > psudo input upfront. If we use distributed random number generation, then I > guess the performance of hadoop should be similar with some message passing > framework, like MPI. So my guess is by using proper method hadoop would be > good in compute intensive applications compared with MPI. > Not quite sure what algorithms you mean here, but for trivial parallelism, map-reduce is a fine way to go. MPI supports node-to-node communications in ways that map-reduce does not, however, which requires that you iterate map-reduce steps for many algorithms. With Hadoop's current implementation, this is horrendously slow (minimum 20-30 seconds per iteration). Sometimes you can avoid this by clever tricks. For instance, random projection can compute the key step in an SVD decomposition with one map-reduce while the comparable Lanczos algorithm requires more than one step per eigenvector (and we often want 100 of them!). Sometimes, however, there are no known algorithms that avoid the need for repeated communication. For these problems, Hadoop as it stands may be a poor fit. Help is on the way, however, with the MapReduce 2.0 work because that will allow much more flexible models of computation. > 2. I am looking for some applications, which has large data sets and > requires intensive computation. An application can be divided into a > workflow, including either map reduce operations, and message passing like > operations. For example, in step 1 I use hadoop MR processes 10TB of data > and generates small output, say, 10GB. This 10GB can be fit into memory and > they are better be processed with some interprocess communication, which > will boost the performance. So in step 2 I will use MPI, etc. > Some machine learning algorithms require features that are much smaller than the original input. This leads to exactly the pattern you describe. Integrating MPI with map-reduce is currently difficult and/or very ugly, however. Not impossible and there are hackish ways to do the job, but they are hacks. --bcaec51a8e0cd5c0a504a228ab30--