Return-Path: X-Original-To: apmail-accumulo-dev-archive@www.apache.org Delivered-To: apmail-accumulo-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7F39611A96 for ; Thu, 19 Jun 2014 15:53:18 +0000 (UTC) Received: (qmail 78190 invoked by uid 500); 19 Jun 2014 15:53:18 -0000 Delivered-To: apmail-accumulo-dev-archive@accumulo.apache.org Received: (qmail 78153 invoked by uid 500); 19 Jun 2014 15:53:18 -0000 Mailing-List: contact dev-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@accumulo.apache.org Delivered-To: mailing list dev@accumulo.apache.org Received: (qmail 78131 invoked by uid 99); 19 Jun 2014 15:53:17 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 19 Jun 2014 15:53:17 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of josh.elser@gmail.com designates 209.85.192.177 as permitted sender) Received: from [209.85.192.177] (HELO mail-pd0-f177.google.com) (209.85.192.177) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 19 Jun 2014 15:53:14 +0000 Received: by mail-pd0-f177.google.com with SMTP id y10so1931099pdj.8 for ; Thu, 19 Jun 2014 08:52:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; bh=47fqBJ5sMSTTMKZcHuWiyYE9cSQVByOnXcabPGjHkTo=; b=CcWYWKeE/BIbg9nAGnO2M5VjDJfOz5rPFTvRTSJ/aOOyzy46KeYgcm49XmPHA712El NyzJDK7PAyh/K3al/H/CwDPw96ADy/uTdP7je7eMkvccn8J9PCIyQg6QBCukV6X8ddCf v1E1JO2qZulRSdN58MDuqbwkvni2nbm3lmTbqlI/JwDnxy4x7udf6paFVAeolLjhxH3A YYDaQSX+Fg0GR4w+LfPDO1Ie244a9j7o/5fGgeBW4p5+LxSl7xQMr9wVRnYnBFu6zLpJ pxIZYuLrKDYeKG40sgRFVP8J2UQcaVsXhwe3gik+mqY4tsNYkWxmm44EBdndKNL4Vx9u PxrA== X-Received: by 10.68.78.66 with SMTP id z2mr6659360pbw.71.1403193169888; Thu, 19 Jun 2014 08:52:49 -0700 (PDT) Received: from HW10447.local ([192.175.27.2]) by mx.google.com with ESMTPSA id zx1sm9236663pbc.60.2014.06.19.08.52.48 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 19 Jun 2014 08:52:49 -0700 (PDT) Message-ID: <53A30750.7070603@gmail.com> Date: Thu, 19 Jun 2014 08:52:48 -0700 From: Josh Elser User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:24.0) Gecko/20100101 Thunderbird/24.6.0 MIME-Version: 1.0 To: dev@accumulo.apache.org Subject: Re: Is Data Locality Helpful? (or why run tserver and datanode on the same box?) References: In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org I believe this happens via the DfsClient, but you can only expect the first block of a file to actually be on the local datanode (assuming there is one). Everything else is possible to be remote. Assuming you have a proper rack script set up, you would imagine that you'll still get at least one rack-local replica (so you'd have a block nearby). Interestingly (at least to me), I believe HBase does a bit of work in region (tablet) assignments to try to maximize the locality of regions WRT the datanode that is hosting the blocks that make up that file. I need to dig into their code some day though. In general, Accumulo and HBase tend to be relatively comparable to one another with performance when properly configured which makes me apt to think that data locality can help, but it's not some holy grail (of course you won't ever hear me claim anything be in that position). I will say that I haven't done any real quantitative analysis either though. tl;dr HDFS block locality should not be affecting the functionality of Accumulo. On 6/19/14, 7:25 AM, Corey Nolet wrote: > AFAIK, the locality may not be guaranteed right away unless the data for a > tablet was first ingested on the tablet server that is responsible for that > tablet, otherwise you'll need to wait for a major compaction to rewrite the > RFiles locally on the tablet server. I would assume if the tablet server is > not on the same node as the datanode, those files will probably be spread > across the cluster as if you were ingesting data from outside the cloud. > > A recent discussion with Bill Slacum also brought to light a possible > problem of the HDFS balancer [1] re-balancing blocks after the fact which > could eventually pull blocks onto datanodes that are not local to the > tablets. I believe remedy for this was to turn off the balancer or not have > it run. > > [1] > http://www.swiss-scalability.com/2013/08/hadoop-hdfs-balancer-explained.html > > > > > On Thu, Jun 19, 2014 at 10:07 AM, David Medinets > wrote: > >> At the Accumulo Summit and on a recent client site, there have been >> conversations about Data Locality and Accumulo. >> >> I ran an experiment to see that Accumulo can scan tables when the >> tserver process is run on a server without a datanode process. I >> followed these steps: >> >> 1. Start three node cluster >> 2. Load data >> 3. Kill datanode on slave1 >> 4. Wait until Hadoop notices dead node. >> 5. Kill tserver on slave2 >> 6. Wait until Accumulo notices dead node. >> 7. Run the accumulo shell on master and slave1 to verify entries can be >> scanned. >> >> Accumulo handled this situation just fine. As I expected. >> >> How important (or not) is it to run tserver and datanode on the same >> server? >> Does the Data Locality implied by running them together exist? >> Can the benefit be quantified? >> >