Return-Path: X-Original-To: apmail-accumulo-user-archive@www.apache.org Delivered-To: apmail-accumulo-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6DCE610A3E for ; Sat, 3 Jan 2015 18:26:39 +0000 (UTC) Received: (qmail 91412 invoked by uid 500); 3 Jan 2015 18:26:40 -0000 Delivered-To: apmail-accumulo-user-archive@accumulo.apache.org Received: (qmail 91356 invoked by uid 500); 3 Jan 2015 18:26:40 -0000 Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@accumulo.apache.org Delivered-To: mailing list user@accumulo.apache.org Received: (qmail 91345 invoked by uid 99); 3 Jan 2015 18:26:40 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 03 Jan 2015 18:26:40 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of josh.elser@gmail.com designates 209.85.192.46 as permitted sender) Received: from [209.85.192.46] (HELO mail-qg0-f46.google.com) (209.85.192.46) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 03 Jan 2015 18:26:13 +0000 Received: by mail-qg0-f46.google.com with SMTP id q107so13831372qgd.33 for ; Sat, 03 Jan 2015 10:26:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; bh=erzMtn7bqA5JJn2rSOwUwm44vNOqET3LEy+rqQolHUc=; b=QgadZiavQ62RIDn4NUmulDefUspMdt6yZY4gI8OoGkJxrC+WnzkkFPizlc6aADArAj CRq780TzltQxbP0XUon1xTbVptdxRrz8IY9OeY0RdMPwKBzNj0cq1cLltjYWnSzoOFlF kahLvq2bxZSDRrdnbzQ3lovVweEXd4z0Qsxd7FoNhVd9fBNajDM0ZzLn4AF9NgkPGv69 IgZlfT9hkj/EZLisMZ3gRDZSIjFPawZT6PnjHDf/RjrR34FsLTBJFIXKUO0P14qwIVVa AMlgpO4sK7xES5/DJh6hXnlAHEb/aaA6tGaStCA7b4uz2EqjOJhGxeEoGC79nIfiRI1l 7AMQ== X-Received: by 10.140.42.166 with SMTP id c35mr68107616qga.42.1420309572468; Sat, 03 Jan 2015 10:26:12 -0800 (PST) Received: from [192.168.2.38] (pool-71-166-48-231.bltmmd.fios.verizon.net. [71.166.48.231]) by mx.google.com with ESMTPSA id p69sm29997353qga.27.2015.01.03.10.26.11 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Sat, 03 Jan 2015 10:26:11 -0800 (PST) Message-ID: <54A8345E.9090405@gmail.com> Date: Sat, 03 Jan 2015 13:26:38 -0500 From: Josh Elser User-Agent: Postbox 3.0.11 (Windows/20140602) MIME-Version: 1.0 To: user@accumulo.apache.org Subject: Re: AccumuloFileOutputFormat tuning References: <739FC10C-78D9-46F1-B2F5-EA6D666899A4@argyledata.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Virus-Checked: Checked by ClamAV on apache.org Could also use JVisualVM which is capable of giving some better reports on benchmarks compared to manually inspecting jstacks. Keith Turner wrote: > You can try sampling using jstack as a simple and quick way to profile. > Jstack a process writing rfiles ~10 times, with some pause he tween. > Then look at a particular thread writing data across the jstack saves, > do you see the same code being executed in multiple jstacks? If so what > code is that? > > Sent from phone. Please excuse typos and brevity. > > On Jan 3, 2015 12:46 AM, "Ara Ebrahimi" > wrote: > > Hi, > > I’m trying to optimize our map/reduce job which generates RFiles > using AccumuloFileOutputFormat. We have a specific time window and > within that time window we need to generate a predefined amount of > simulation data and in terms of number of core we also have an upper > bound we can use. Disks are also fixed at 4 per node and they are > all SSDs. So I can’t employ more machines or more disks or cores to > achieve higher write/s numbers. > > So far we’ve managed to utilize 100% of all available cores and the > SSD disks are also highly utilized. I’m trying to reduce processing > time and we are willing to waste more disk space to achieve higher > data generation speed. The data itself is 10s of columns of floating > numbers, all serialized to fixed 9-byte values which doesn’t lend > well to compression. With no compression and replication set to 1 we > can generate the same amount of data in almost half the time. With > snappy it’s almost 10% more data generation time compared to no > compression and almost twice more size on disk for the all the > generated RFiles. > > dataBlockSize doesn’t seem to change anything for non-compressed > data. indexBlockSize also didn't change anything (tried 64K vs the > default 128K). > > Any other tricks I could employ to achieve higher write/s numbers? > > Ara. > > > > ________________________________ > > This message is for the designated recipient only and may contain > privileged, proprietary, or otherwise confidential information. If > you have received it in error, please notify the sender immediately > and delete the original. Any other use of the e-mail by you is > prohibited. Thank you in advance for your cooperation. > > ________________________________ >