Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@accumulo.apache.org
Subject: Re: Optimize Accumulo scan speed
To: user@accumulo.apache.org
References: 
 <CACB+WM4dasjrkCR6zB-VXoR79h3iwVeaDjB9rvN=iGq9b-TvJg@mail.gmail.com>
From: Andrew Hulbert <ahulbert@ccri.com>
Message-ID: <570A704F.1010403@ccri.com>
Date: Sun, 10 Apr 2016 11:25:03 -0400
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101
 Thunderbird/38.7.0
MIME-Version: 1.0
In-Reply-To: 
 <CACB+WM4dasjrkCR6zB-VXoR79h3iwVeaDjB9rvN=iGq9b-TvJg@mail.gmail.com>
Content-Type: multipart/alternative;
 boundary="------------060808060109070905010205"

This is a multi-part message in MIME format.
--------------060808060109070905010205
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit

Mario,

Are you using a Scanner or a BatchScanner?

One thing we did in the past with a geohash-based schema was to prefix a 
shard ID in front of the geohash that allows you to involve all the 
tservers in the scan. You'd multiply your ranges by the number of 
tservers you have but if the client is not the bottleneck then it may 
increase your throughput.

Andrew

On 04/10/2016 11:05 AM, Mario Pastorelli wrote:
> Hi,
>
> I'm currently having some scan speed issues with Accumulo and I would 
> like to understand why and how can I solve it. I have geographical 
> data and I use as primary key the day and then the geohex, which is a 
> linearisation of lat and lon. The reason for this key is that I always 
> query the data for one day but for a set of geohexes with represent a 
> zone, so with this schema I can scan use a single scan to read all the 
> data for one day with few seeks. My problem is that the scan is 
> painfully slow: for instance, to read 5617019 rows it takes around 17 
> seconds and the scan speed is 13MB/s, less than 750k scan entries/s 
> and around 300 seeks. I enable the tracer and this is what I've got
>
> 17325+0 Dice@srv1 Dice.query
> 11+1 Dice@srv1 scan 11+1 Dice@srv1 scan:location
> 5+13 Dice@srv1 scan 5+13 Dice@srv1 scan:location
> 4+19 Dice@srv1 scan 4+19 Dice@srv1 scan:location
> 5+23 Dice@srv1 scan 4+24 Dice@srv1 scan:location
> I'm not sure how to speedup the scanning. I have the following question:
>   - is this speed normal?
>   - can I involve more servers in the scan? Right now only two server 
> have the ranges but with a cluster of 15 machines it would be nice to 
> involve more of them. Is it possible?
>
> Thanks,
> Mario
>
> -- 
> Mario Pastorelli| TERALYTICS
>
> *software engineer*
>
> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
> phone:+41794381682
> email: mario.pastorelli@teralytics.ch 
> <mailto:mario.pastorelli@teralytics.ch>
> www.teralytics.net <http://www.teralytics.net/>
>
> Company registration number: CH-020.3.037.709-7 | Trade register 
> Canton Zurich
> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, 
> Yann de Vries
>
> This e-mail message contains confidential information which is for the 
> sole attention and use of the intended recipient. Please notify us at 
> once if you think that it may not be intended for you and delete it 
> immediately.
>


--------------060808060109070905010205
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: 8bit

<html>
  <head>
    <meta content="text/html; charset=utf-8" http-equiv="Content-Type">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    Mario,<br>
    <br>
    Are you using a Scanner or a BatchScanner?<br>
    <br>
    One thing we did in the past with a geohash-based schema was to
    prefix a shard ID in front of the geohash that allows you to involve
    all the tservers in the scan. You'd multiply your ranges by the
    number of tservers you have but if the client is not the bottleneck
    then it may increase your throughput.<br>
    <br>
    Andrew<br>
    <br>
    <div class="moz-cite-prefix">On 04/10/2016 11:05 AM, Mario
      Pastorelli wrote:<br>
    </div>
    <blockquote
cite="mid:CACB+WM4dasjrkCR6zB-VXoR79h3iwVeaDjB9rvN=iGq9b-TvJg@mail.gmail.com"
      type="cite">
      <div dir="ltr">
        <div>Hi,<br>
          <br>
        </div>
        I'm currently having some scan speed issues with Accumulo and I
        would like to understand why and how can I solve it. I have
        geographical data and I use as primary key the day and then the
        geohex, which is a linearisation of lat and lon. The reason for
        this key is that I always query the data for one day but for a
        set of geohexes with represent a zone, so with this schema I can
        scan use a single scan to read all the data for one day with few
        seeks. My problem is that the scan is painfully slow: for
        instance, to read 5617019 rows it takes around 17 seconds and
        the scan speed is 13MB/s, less than 750k scan entries/s and
        around 300 seeks. I enable the tracer and this is what I've got<br>
        <br>
        17325+0
        Dice@srv1
        Dice.query<br>
        11+1
        Dice@srv1
        scan
        11+1
        Dice@srv1
        scan:location<br>
        5+13
        Dice@srv1
        scan
        5+13
        Dice@srv1
        scan:location<br>
        4+19
        Dice@srv1
        scan
        4+19
        Dice@srv1
        scan:location<br>
        5+23
        Dice@srv1
        scan
        4+24
        Dice@srv1
        scan:location<br>
        <div>
          <div>I'm not sure how to speedup the scanning. I have the
            following question:<br>
          </div>
          <div>  - is this speed normal?<br>
          </div>
          <div>  - can I involve more servers in the scan? Right now
            only two server have the ranges but with a cluster of 15
            machines it would be nice to involve more of them. Is it
            possible?<br>
            <br>
          </div>
          <div>Thanks,<br>
          </div>
          <div>Mario<br>
          </div>
          <div><br>
          </div>
          <div>-- <br>
            <div class="gmail_signature">
              <div dir="ltr">
                <div>
                  <div dir="ltr">
                    <div>
                      <div dir="ltr">
                        <div>
                          <div dir="ltr"><span
                              style="font-family:&quot;Open
                              Sans&quot;,sans-serif" lang="DE"><font
                                color="#267fcf">Mario Pastorelli</font></span><span
                              style="font-family:&quot;Open
                              Sans&quot;,sans-serif;color:rgb(78,78,78)"
                              lang="DE"> | TERA</span><span
                              style="font-family:&quot;Open
                              Sans&quot;,sans-serif;color:rgb(38,127,207)"
                              lang="DE">LYTICS</span><br>
                            <div>
                              <div dir="ltr">
                                <p style="color:rgb(80,0,80)"><span
                                    style="font-size:18pt;font-family:Arial,sans-serif"
                                    lang="DE"></span><b><span
                                      style="font-size:9pt;font-family:&quot;Open
Sans&quot;,sans-serif;color:rgb(78,78,78)" lang="DE">software engineer</span></b><span
                                    style="font-size:9pt;font-family:&quot;Open
                                    Sans&quot;,sans-serif" lang="DE"></span></p>
                                <p style="color:rgb(80,0,80)"><span
                                    style="font-size:8pt;font-family:&quot;Open
Sans&quot;,sans-serif;color:rgb(68,68,68)" lang="DE">Teralytics AG
                                    | Zollstrasse 62 | 8005 Zurich |
                                    Switzerland</span><span
                                    style="font-size:8pt;font-family:&quot;Open
Sans&quot;,sans-serif;color:rgb(78,78,78)" lang="DE"> <br>
                                    phone:</span><span
                                    style="font-size:8pt;font-family:&quot;Open
                                    Sans&quot;,sans-serif" lang="DE"><font
                                      color="#444444"> </font><font
                                      color="#3d85c6">+41794381682</font></span><span
                                    style="font-size:8pt;font-family:&quot;Open
                                    Sans&quot;,sans-serif" lang="DE"><br>
                                    <font color="#4e4e4e">email: <a
                                        moz-do-not-send="true"
                                        href="mailto:mario.pastorelli@teralytics.ch"
                                        target="_blank"><a class="moz-txt-link-abbreviated" href="mailto:mario.pastorelli@teralytics.ch">mario.pastorelli@teralytics.ch</a></a></font></span><span
                                    style="font-size:8pt;font-family:&quot;Open
                                    Sans&quot;,sans-serif" lang="DE"><br>
                                    <a moz-do-not-send="true"
                                      href="http://www.teralytics.net/"
                                      style="color:rgb(17,85,204)"
                                      target="_blank"><span
                                        style="text-decoration:none"><font
                                          color="#3d85c6">www.teralytics.net</font></span></a></span></p>
                                <p
style="margin-bottom:0.0001pt;line-height:16pt;background-image:initial;background-repeat:initial"><span
style="color:rgb(51,51,51);font-family:Arial,sans-serif;font-size:8pt;line-height:16pt">Company
                                    registration number:
                                    CH-020.3.037.709-7 | Trade register
                                    Canton
                                    Zurich<br>
                                  </span><span
style="color:rgb(51,51,51);font-family:Arial,sans-serif;font-size:8pt;line-height:16pt">Board
                                    of directors: Georg Polzer, Luciano
                                    Franceschina, Mark Schmitz, Yann de
                                    Vries</span></p>
                                <p
style="margin-bottom:0.0001pt;line-height:16pt;background-image:initial;background-repeat:initial"><span
style="color:rgb(51,51,51);font-family:Arial,sans-serif;font-size:8pt;line-height:16pt">This
                                    e-mail message contains confidential
                                    information which is for the
                                    sole attention and use of the
                                    intended recipient. Please notify us
                                    at once if
                                    you think that it may not be
                                    intended for you and delete it
                                    immediately.</span><span
                                    style="font-size:8pt;font-family:Arial,sans-serif;color:rgb(51,51,51)"></span></p>
                              </div>
                            </div>
                          </div>
                        </div>
                      </div>
                    </div>
                  </div>
                </div>
              </div>
            </div>
          </div>
        </div>
      </div>
    </blockquote>
    <br>
  </body>
</html>

--------------060808060109070905010205--