Return-Path: X-Original-To: apmail-accumulo-user-archive@www.apache.org Delivered-To: apmail-accumulo-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 49AD810852 for ; Sat, 3 Jan 2015 16:29:03 +0000 (UTC) Received: (qmail 17586 invoked by uid 500); 3 Jan 2015 16:29:04 -0000 Delivered-To: apmail-accumulo-user-archive@accumulo.apache.org Received: (qmail 17534 invoked by uid 500); 3 Jan 2015 16:29:03 -0000 Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@accumulo.apache.org Delivered-To: mailing list user@accumulo.apache.org Received: (qmail 17524 invoked by uid 99); 3 Jan 2015 16:29:03 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 03 Jan 2015 16:29:03 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW X-Spam-Check-By: apache.org Received-SPF: error (nike.apache.org: local policy) Received: from [209.85.216.173] (HELO mail-qc0-f173.google.com) (209.85.216.173) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 03 Jan 2015 16:28:38 +0000 Received: by mail-qc0-f173.google.com with SMTP id i17so13860404qcy.4 for ; Sat, 03 Jan 2015 08:28:16 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=917sgIrGftN6AzPR6p3jxhPK1UGxruJJ/lpJtDBDQU0=; b=QdOC0VUa8b3sw/MDIsiYtadfHrLvkvuo75yuu9ai7+8MbV1vujGV2mQnu1MKLYgAAt yIQtZ/XbEpnE4ns6fFGCAMBxDpoBle4seOe7lQSKhq/vIERzXcfkjnwIohbFJUPuZQvV jJdrlcIos3/0DsStao3Fq8qUg/CyKRwreS27Fx6UN3zuPpj6NE0j50+VxsV/foXAE9WQ YZ9NppPuMofqGs8/xbxJ45BPmx5mWTPqJmBH7tWIraKtVg6237b6CS6IheoqhiUHBHFg WiwfiOwSs7c92RAFKBDYN8xYfAIXpI2TJDjfi3MfCuH7rjDzm9lTGugJXzcB2DZiOwTl OsKw== X-Gm-Message-State: ALoCoQkG7Pk0H5RJG8AjeBfuqPny/txtoOjC2fKmhY2+MGuUC6Kavxelr6kaIY7KASr8lQoQXviC MIME-Version: 1.0 X-Received: by 10.224.171.129 with SMTP id h1mr88052321qaz.74.1420302496515; Sat, 03 Jan 2015 08:28:16 -0800 (PST) Received: by 10.229.9.138 with HTTP; Sat, 3 Jan 2015 08:28:16 -0800 (PST) Received: by 10.229.9.138 with HTTP; Sat, 3 Jan 2015 08:28:16 -0800 (PST) In-Reply-To: <739FC10C-78D9-46F1-B2F5-EA6D666899A4@argyledata.com> References: <739FC10C-78D9-46F1-B2F5-EA6D666899A4@argyledata.com> Date: Sat, 3 Jan 2015 11:28:16 -0500 Message-ID: Subject: Re: AccumuloFileOutputFormat tuning From: Keith Turner To: user@accumulo.apache.org Content-Type: multipart/alternative; boundary=047d7b5d51f0c768d7050bc1f12f X-Virus-Checked: Checked by ClamAV on apache.org --047d7b5d51f0c768d7050bc1f12f Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable You can try sampling using jstack as a simple and quick way to profile. Jstack a process writing rfiles ~10 times, with some pause he tween. Then look at a particular thread writing data across the jstack saves, do you see the same code being executed in multiple jstacks? If so what code is that? Sent from phone. Please excuse typos and brevity. On Jan 3, 2015 12:46 AM, "Ara Ebrahimi" wrote= : > Hi, > > I=E2=80=99m trying to optimize our map/reduce job which generates RFiles = using > AccumuloFileOutputFormat. We have a specific time window and within that > time window we need to generate a predefined amount of simulation data an= d > in terms of number of core we also have an upper bound we can use. Disks > are also fixed at 4 per node and they are all SSDs. So I can=E2=80=99t em= ploy more > machines or more disks or cores to achieve higher write/s numbers. > > So far we=E2=80=99ve managed to utilize 100% of all available cores and t= he SSD > disks are also highly utilized. I=E2=80=99m trying to reduce processing t= ime and we > are willing to waste more disk space to achieve higher data generation > speed. The data itself is 10s of columns of floating numbers, all > serialized to fixed 9-byte values which doesn=E2=80=99t lend well to comp= ression. > With no compression and replication set to 1 we can generate the same > amount of data in almost half the time. With snappy it=E2=80=99s almost 1= 0% more > data generation time compared to no compression and almost twice more siz= e > on disk for the all the generated RFiles. > > dataBlockSize doesn=E2=80=99t seem to change anything for non-compressed = data. > indexBlockSize also didn't change anything (tried 64K vs the default 128K= ). > > Any other tricks I could employ to achieve higher write/s numbers? > > Ara. > > > > ________________________________ > > This message is for the designated recipient only and may contain > privileged, proprietary, or otherwise confidential information. If you ha= ve > received it in error, please notify the sender immediately and delete the > original. Any other use of the e-mail by you is prohibited. Thank you in > advance for your cooperation. > > ________________________________ > --047d7b5d51f0c768d7050bc1f12f Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable

You can try sampling using jstack as a simple and quick way = to profile.=C2=A0 Jstack a process writing rfiles ~10 times, with some paus= e he tween.=C2=A0 Then look at a particular thread writing data across the = jstack saves, do you see the same code being executed in multiple jstacks?= =C2=A0 If so what code is that?

Sent from phone. Please excuse typos and brevity.

On Jan 3, 2015 12:46 AM, "Ara Ebrahimi"= ; <ara.ebrahimi@argyledat= a.com> wrote:
Hi,

I=E2=80=99m trying to optimize our map/reduce job which generates RFiles us= ing AccumuloFileOutputFormat. We have a specific time window and within tha= t time window we need to generate a predefined amount of simulation data an= d in terms of number of core we also have an upper bound we can use. Disks = are also fixed at 4 per node and they are all SSDs. So I can=E2=80=99t empl= oy more machines or more disks or cores to achieve higher write/s numbers.<= br>
So far we=E2=80=99ve managed to utilize 100% of all available cores and the= SSD disks are also highly utilized. I=E2=80=99m trying to reduce processin= g time and we are willing to waste more disk space to achieve higher data g= eneration speed. The data itself is 10s of columns of floating numbers, all= serialized to fixed 9-byte values which doesn=E2=80=99t lend well to compr= ession. With no compression and replication set to 1 we can generate the sa= me amount of data in almost half the time. With snappy it=E2=80=99s almost = 10% more data generation time compared to no compression and almost twice m= ore size on disk for the all the generated RFiles.

dataBlockSize doesn=E2=80=99t seem to change anything for non-compressed da= ta. indexBlockSize also didn't change anything (tried 64K vs the defaul= t 128K).

Any other tricks I could employ to achieve higher write/s numbers?

Ara.



________________________________

This message is for the designated recipient only and may contain privilege= d, proprietary, or otherwise confidential information. If you have received= it in error, please notify the sender immediately and delete the original.= Any other use of the e-mail by you is prohibited. Thank you in advance for= your cooperation.

________________________________
--047d7b5d51f0c768d7050bc1f12f--