Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CF12ED913 for ; Tue, 11 Sep 2012 16:22:36 +0000 (UTC) Received: (qmail 36960 invoked by uid 500); 11 Sep 2012 16:22:31 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 36864 invoked by uid 500); 11 Sep 2012 16:22:31 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 36857 invoked by uid 99); 11 Sep 2012 16:22:31 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Sep 2012 16:22:31 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FSL_RCVD_USER,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of mogwaing@gmail.com designates 209.85.220.176 as permitted sender) Received: from [209.85.220.176] (HELO mail-vc0-f176.google.com) (209.85.220.176) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Sep 2012 16:22:18 +0000 Received: by vcbfl11 with SMTP id fl11so1089250vcb.35 for ; Tue, 11 Sep 2012 09:20:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=/xRYNsb+7HiQoPhgAs56Bi7FeMOWM+9TDSpaBm/YWRM=; b=UGMP6xlLutB3NGo/y04sYulFXn7+6FxsplOvSFGIV/YN+5OMFLvibfyKUGlkt0489f l9YVPQAr9lmama9E0Kwdx96O26FPmZf0ScsZ4BoGxRD6tpwm3eIzW87TUldRWKjQc9Ga kNT5F52K+lYbNJjguOuqvM+fjMH5K2ZnkG6iQAxacJ6dCG0El2zMXwygMgJ6F9Wud7Gg ljviXFEAHj1SkHQaxRSmKOMZ+GtycvzODB1Ho/VoiqPX8f/eQ2Xz0K8u7+XcsKbvgx1A l7rzCTo32jcWuksXx9ToyS6CT/xeIur4sJiv9AKpQhZiDZqPfHzUqlpTuzNR2DiOnKB3 QUnw== MIME-Version: 1.0 Received: by 10.58.239.232 with SMTP id vv8mr28574284vec.37.1347380453770; Tue, 11 Sep 2012 09:20:53 -0700 (PDT) Received: by 10.58.216.38 with HTTP; Tue, 11 Sep 2012 09:20:53 -0700 (PDT) In-Reply-To: References: Date: Wed, 12 Sep 2012 01:20:53 +0900 Message-ID: Subject: Re: Question about the task assignment strategy From: Hiroyuki Yamada To: user@hadoop.apache.org Content-Type: text/plain; charset=ISO-8859-1 Hi, thank you for the comment. > Task assignment takes data locality into account first and not block sequence. Does it work like that when replica factor is set to 1 ? I just had a experiment to check the behavior. There are 14 nodes (node01 to node14) and there are 14 datanodes and 14 tasktrackers working. I first created a data to be processed in each node (say data01 to data14), and I put the each data to the hdfs from each node (at /data directory. /data/data01, ... /data/data14). Replica factor is set to 1, so according to the default block placement policy, each data is stored at local node. (data01 is stored at node01, data02 is stored at node02 and so on) In that setting, I launched a job that processes the /data and what happened is that tasktrackers read from data01 to data14 sequentially, which means tasktrackers first take all data from node01 and then node02 and then node03 and so on. If tasktracker takes data locality into account as you say, each tasktracker should take the local task(data). (tasktrackers at node02 should take data02 blocks if there is any) But, it didn't work like that. What this is happening ? Is there any documents about this ? What part of the source code is doing that ? Regards, Hiroyuki On Tue, Sep 11, 2012 at 11:27 PM, Hemanth Yamijala wrote: > Hi, > > Task assignment takes data locality into account first and not block > sequence. In hadoop, tasktrackers ask the jobtracker to be assigned tasks. > When such a request comes to the jobtracker, it will try to look for an > unassigned task which needs data that is close to the tasktracker and will > assign it. > > Thanks > Hemanth > > > On Tue, Sep 11, 2012 at 6:31 PM, Hiroyuki Yamada wrote: >> >> Hi, >> >> I want to make sure my understanding about task assignment in hadoop >> is correct or not. >> >> When scanning a file with multiple tasktrackers, >> I am wondering how a task is assigned to each tasktracker . >> Is it based on the block sequence or data locality ? >> >> Let me explain my question by example. >> There is a file which composed of 10 blocks (block1 to block10), and >> block1 is the beginning of the file and block10 is the tail of the file. >> When scanning the file with 3 tasktrackers (tt1 to tt3), >> I am wondering if >> task assignment is based on the block sequence like >> first tt1 takes block1 and tt2 takes block2 and tt3 takes block3 and >> tt1 takes block4 and so on >> or >> task assignment is based on the task(data) locality like >> first tt1 takes block2(because it's located in the local) and tt2 >> takes block1 (because it's located in the local) and >> tt3 takes block 4(because it's located in the local) and so on. >> >> As far as I experienced and the definitive guide book says, >> I think that the first case is the task assignment strategy. >> (and if there are many replicas, closest one is picked.) >> >> Is this right ? >> >> If this is right, is there any way to do like the second case >> with the current implementation ? >> >> Thanks, >> >> Hiroyuki > >