Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id CC533200B64 for ; Tue, 2 Aug 2016 17:56:10 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id CACE8160A76; Tue, 2 Aug 2016 15:56:10 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 1027B160A65 for ; Tue, 2 Aug 2016 17:56:09 +0200 (CEST) Received: (qmail 27914 invoked by uid 500); 2 Aug 2016 15:56:08 -0000 Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@accumulo.apache.org Delivered-To: mailing list user@accumulo.apache.org Received: (qmail 27904 invoked by uid 99); 2 Aug 2016 15:56:08 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 02 Aug 2016 15:56:08 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 609BFC2247 for ; Tue, 2 Aug 2016 15:56:08 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.72 X-Spam-Level: X-Spam-Status: No, score=-0.72 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=deenlo-com.20150623.gappssmtp.com Received: from mx2-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id HOsBCFOOXx7S for ; Tue, 2 Aug 2016 15:56:06 +0000 (UTC) Received: from mail-oi0-f45.google.com (mail-oi0-f45.google.com [209.85.218.45]) by mx2-lw-eu.apache.org (ASF Mail Server at mx2-lw-eu.apache.org) with ESMTPS id 5C1B95F233 for ; Tue, 2 Aug 2016 15:56:05 +0000 (UTC) Received: by mail-oi0-f45.google.com with SMTP id j185so242065893oih.0 for ; Tue, 02 Aug 2016 08:56:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=deenlo-com.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=LzO43+0L2BsWmTno0/u5FjPJ3WXVkCHY733bF1kcb48=; b=D58bfX4Mm2SXvFNhD/TqYdNMPnR/x8sBMM1Dk5KWtNbM0jTFLHFc8hMm3VaD1Rvyin KDsRqdLzqy4FQe/U5pRoBL1KQp2D2QWSdZyPWxdod+qPHiiw3ouB31svCx/8SJ+R7KUV GN2Y2jNHEJ0rVoBqnd/KDGBOmzUxIo3IetOg4iAILc1yyK0WQQJs81ZS51HbdTsxXxwe OCPJcmlFPyiYnl8HtbMpNo0J5syiEEGyrmAgp/2iFU+gzCJjP/ciTyVz7lut+BkPIxru mGSmYr9s+hkCH4+ktETBIvrOJG+/CklKt8cLWKdCJxSL2tBYjNw4l3XgjOPzkGFEHd+2 odKA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=LzO43+0L2BsWmTno0/u5FjPJ3WXVkCHY733bF1kcb48=; b=lmV1LcThEkwAhtNAnS69ITuE6rM0AajPPXbErInV1lLLQz9EPP2RTxWg6NxAeDslti GzK5WsPG1uzJcqVPy9Fzt+XO7dt316GnJDLM/bqcwbqPdkUCmnRQWPJU++Bzx/5d4IxH u0aJX39yYN3XD4Zhm21spDtQzl0U8SIHBcBG8wWGfjsreMMO89Qlv3UeJi/2UgmSf5yW X2VBRNeujuLF7Y/IGsgUBN5puz63SiuhKviPQz8xvXyEneLjdn1yBgzypzYud6Y6vARJ xGuGO6rGAultqsarlqgKy1F+ryFMbZGH799Byc9CnU39sYAZgBwGo7dguNijs3BF0FlD BHiw== X-Gm-Message-State: AEkoousrt2m3m/GXZ24YNEXOciFGNfGKlyAZOqPtfgZGqf6tCZGQhfxffbYzg1sv8GB08cF4QnYKQ7uRfNKFoQ== X-Received: by 10.202.220.135 with SMTP id t129mr38347624oig.31.1470153363963; Tue, 02 Aug 2016 08:56:03 -0700 (PDT) MIME-Version: 1.0 Received: by 10.202.4.18 with HTTP; Tue, 2 Aug 2016 08:56:03 -0700 (PDT) In-Reply-To: References: From: Keith Turner Date: Tue, 2 Aug 2016 11:56:03 -0400 Message-ID: Subject: Re: AccumuloInputFormat and data locality for jobs that don't need keys sorted To: user@accumulo.apache.org Content-Type: text/plain; charset=UTF-8 archived-at: Tue, 02 Aug 2016 15:56:11 -0000 If you are not aware of it, something else to consider is the setOfflineTableScan[1] option. This can support much faster reads of data. In my experience this usually only useful for map only jobs like you are doing. When doing map/reduce the sort can make a speedup in map read rate irrelevant. You still may not get locality if tablets have multiple files because a merged read of the files is done in the mapper. Offline map reduce in Accumulo attempts to run mappers at the last location a tablet compacted some of its files. Even w/o locality you still avoid the cost of de-serializing , re-serializing, transmission, and de-serializing data in the tserver+client. [1]: http://accumulo.apache.org/1.7/apidocs/org/apache/accumulo/core/client/mapred/InputFormatBase.html#setOfflineTableScan%28org.apache.hadoop.mapred.JobConf,%20boolean%29 On Mon, Aug 1, 2016 at 7:55 PM, Mario Pastorelli wrote: > I would like to use an Accumulo table as input for a Spark job. Let me > clarify that my job doesn't need keys sorted and Accumulo is purely used to > filter the input data thanks to it's index on the keys. The data that I need > to process in Spark is still a small portion of the full dataset. > I know that Accumulo provides the AccumuloInputFormat but in my tests almost > no task has data locality when I use this input format which leads to poor > performance. I'm not sure why this happens but my guess is that the > AccumuloInputFormat creates one task per range. > I wonder if there is a way to tell to the AccumuloInputFormat to split each > range into the sub-ranges local to each tablet server so that each task in > Spark will will read only data from the same machines where it is running. > > Thanks for the help, > Mario > > -- > Mario Pastorelli | TERALYTICS > > software engineer > > Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland > phone: +41794381682 > email: mario.pastorelli@teralytics.ch > www.teralytics.net > > Company registration number: CH-020.3.037.709-7 | Trade register Canton > Zurich > Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann > de Vries > > This e-mail message contains confidential information which is for the sole > attention and use of the intended recipient. Please notify us at once if you > think that it may not be intended for you and delete it immediately.