Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@accumulo.apache.org
MIME-Version: 1.0
In-Reply-To: 
 <CAPx=JkY2KAkmk_BYO3MvpYOHZSYKpZNz29NSFFNvbMUyPFyY1A@mail.gmail.com>
References: 
 <CADXX17N5cCE6eMEK4TqAD8YEPj1=vmr4aFiX5uohC5GHRC+6Lg@mail.gmail.com>
	<CAGUtCHr6Rp4Z2MYaL_Lzg1dR9AOuT2+roO=SXazsXYykRZxJVg@mail.gmail.com>
	<CADXX17PVWr0GBLhuczOuSEQ0AJqwv102nYh5DbC729ZUP3y=CQ@mail.gmail.com>
	<CAGUtCHq+bs5hauJ2T+qv4q5rvo9RcUhfjmW=iqFjoTRuga+EhA@mail.gmail.com>
	<CADXX17NffE0EP_Y0GMC4ys6iRnWU-d-KdC47RYyCvmjXGw9FMQ@mail.gmail.com>
	<555359A9.2030600@ccri.com>
	<CADxc9Bm7amiH=kLpSiXATMLB5hmVkj2+Doa8-4rERT-PrQp8VA@mail.gmail.com>
	<CADXX17P2BCTyE1P72wPd0EzKYmZ2xz-ivSJPBr4wLet-3rk+vQ@mail.gmail.com>
	<CADxc9Bmkr7POTDySCVmnY3hmeZoNenoQLOcvwJCKZJtxKKGhWg@mail.gmail.com>
	<CADXX17PRdPap3ti2P5emSO7k3C9Aw5fOR4fON_sptSTH_dF2=A@mail.gmail.com>
	<CADxc9BnotYvhMMZ25in4P+Q8NU9EWjwqSsb3KY9zHci5H++JBw@mail.gmail.com>
	<CAPx=JkY2KAkmk_BYO3MvpYOHZSYKpZNz29NSFFNvbMUyPFyY1A@mail.gmail.com>
Date: Thu, 14 May 2015 23:27:30 +0530
Message-ID: 
 <CADXX17PztmUXe6i0RDHt89K8-=jA3fHkLuqP6gNQ3v2meu=vWQ@mail.gmail.com>
Subject: Re: BatchScanner taking too much time to scan rows
From: vaibhav thapliyal <vaibhav.thapliyal.91@gmail.com>
To: user@accumulo.apache.org
Content-Type: multipart/alternative; boundary=90e6ba3fd2811bd4f105160e7672

--90e6ba3fd2811bd4f105160e7672
Content-Type: text/plain; charset=UTF-8

Dylan could you elaborate on the average query time you had?
Thanks
Vaibhav
On 14-May-2015 11:03 pm, "Dylan Hutchison" <dhutchis@mit.edu> wrote:

> I think this is the same issue I found for ACCUMULO-3710
> <https://issues.apache.org/jira/browse/ACCUMULO-3710>, only in my case
> the tserver ran out of memory.  Accumulo doesn't handle large numbers of
> small, disjoint ranges well.  I bet there's room for improvement on both
> the client and tablet server.
> ~Dylan
>
> On Wed, May 13, 2015 at 3:13 PM, Eric Newton <eric.newton@gmail.com>
> wrote:
>
>> Yes, hot-spotting does affect accumulo because you have fewer servers and
>> caches handling your request.
>>
>> Let's say your data is spread out, in a normal distribution from
>> "0".."9".
>>
>> What if you have only 1 split?  You would want it at "5", to divide the
>> data in half, and you could host the halves on different servers.  But if
>> you split at 1, now 10% of your queries go to one tablet, and 90% go to the
>> other.
>>
>> -Eric
>>
>>
>> On Wed, May 13, 2015 at 1:56 PM, vaibhav thapliyal <
>> vaibhav.thapliyal.91@gmail.com> wrote:
>>
>>> Thank you Eric. I will surely do the same. Should uneven distribution
>>> across the tablets affect querying in accumulo?  If this case, it is. Is
>>> this behaviour normal?
>>> On 13-May-2015 10:58 pm, "Eric Newton" <eric.newton@gmail.com> wrote:
>>>
>>>> Yes, that's a great way to split the data evenly.
>>>>
>>>> Also, since the data set is so small, turn on data caching for your
>>>> table:
>>>>
>>>> shell> config -t mytable -s table.cache.block.enable=true
>>>>
>>>> You may want to increase the size of your tserver JVM, and increase the
>>>> size of the cache:
>>>>
>>>> shell> config -s tserver.cache.data.size=1G
>>>>
>>>> This will help with repeated random look-ups.
>>>>
>>>> -Eric
>>>>
>>>> On Wed, May 13, 2015 at 11:31 AM, vaibhav thapliyal <
>>>> vaibhav.thapliyal.91@gmail.com> wrote:
>>>>
>>>>> Thank you Eric.
>>>>>
>>>>> One thing I would like to know. Does pre-splitting the data play a
>>>>> part in querying accumulo?
>>>>>
>>>>> Because I managed to somewhat decrease the querying time.
>>>>> I did the following steps:
>>>>> My table was around 1.47gb so I explicity set the split parameter to
>>>>> 256mb instead of the default 1gb.
>>>>>
>>>>> So I had just 8 tablets. Now when I carried out the same query, it
>>>>> finished in 15s.
>>>>>
>>>>> Is it because of the split points are more evenly distributed?
>>>>>
>>>>> The previous table on which the query took 50s had entries unevenly
>>>>> distributed across the tablets.
>>>>> Thanks
>>>>> Vaibhav
>>>>> On 13-May-2015 7:43 pm, "Eric Newton" <eric.newton@gmail.com> wrote:
>>>>>
>>>>>> This use case is one of the things Accumulo was designed to handle
>>>>>> well. It's the reason there is a BatchScanner.
>>>>>>
>>>>>> I've created:
>>>>>>
>>>>>> https://issues.apache.org/jira/browse/ACCUMULO-3813
>>>>>>
>>>>>> so we can investigate and track down any problems or improvements.
>>>>>>
>>>>>> Feel free to add any other details to the JIRA ticket.
>>>>>>
>>>>>> -Eric
>>>>>>
>>>>>>
>>>>>> On Wed, May 13, 2015 at 10:03 AM, Emilio Lahr-Vivaz <
>>>>>> elahrvivaz@ccri.com> wrote:
>>>>>>
>>>>>>>  It sounds like each of your ranges is an ID, e.g. a single row.
>>>>>>> I've found that scanning lots of non-sequential single-row ranges is pretty
>>>>>>> slow in accumulo. Your best approach is probably to create an index table
>>>>>>> on whatever you are originally trying to query (assuming those 10000 ids
>>>>>>> came from some other query).
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Emilio
>>>>>>>
>>>>>>>
>>>>>>> On 05/13/2015 09:14 AM, vaibhav thapliyal wrote:
>>>>>>>
>>>>>>>  The rf files per tablet vary between 2 to 5 per tablet. The
>>>>>>> entries returned to me by the batchScanner is 460000. The approx. average
>>>>>>> data rate is 0.5 MB/s as seen on the accumulo monitor page.
>>>>>>>
>>>>>>>  A simple scan on the table has an average data rate of about 7-8
>>>>>>> MB/s.
>>>>>>>
>>>>>>>  All the ids exist in the accumulo table.
>>>>>>>
>>>>>>> On 12 May 2015 at 23:39, Keith Turner <keith@deenlo.com> wrote:
>>>>>>>
>>>>>>>> Do you know how much data is being brought back (i.e. 100
>>>>>>>> megabytes)? I am wondering what the data rate is in MB/s.  Do you know how
>>>>>>>> many files per tablet you have?  Do most of the 10,000 ids you are querying
>>>>>>>> for exist?
>>>>>>>>
>>>>>>>> On Tue, May 12, 2015 at 1:58 PM, vaibhav thapliyal <
>>>>>>>> vaibhav.thapliyal.91@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> I have 194 tablets. Currently I am using 20 threads to create the
>>>>>>>>> batchscanner inside the createBatchScanner method.
>>>>>>>>>  On 12-May-2015 11:19 pm, "Keith Turner" <keith@deenlo.com> wrote:
>>>>>>>>>
>>>>>>>>>>   How many tablets do you have?  The batch scanner does not
>>>>>>>>>> parallelize operations within a tablet.
>>>>>>>>>>
>>>>>>>>>>  If you give the batch scanner more threads than there are
>>>>>>>>>> tservers, it will make multilple parallel rpc calls to each tserver if the
>>>>>>>>>> tserver has multiple tablets.  Each rpc may include multiple tablets and
>>>>>>>>>> ranges for each tablet.
>>>>>>>>>>
>>>>>>>>>>  If the batch scanner has less threads than tservers, it will
>>>>>>>>>> make one rpc per tserver per thread.  Each rpc call will include all
>>>>>>>>>> tablets and associated ranges for that tserver.
>>>>>>>>>>
>>>>>>>>>>  Keith
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, May 12, 2015 at 1:39 PM, vaibhav thapliyal <
>>>>>>>>>> vaibhav.thapliyal.91@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>>  I am using BatchScanner to scan rows from a accumulo table.
>>>>>>>>>>> The table has around 187m entries and I am using a 3 node cluster which has
>>>>>>>>>>> accumulo 1.6.1.
>>>>>>>>>>>
>>>>>>>>>>>  I have passed 10000 ids which are stored as row id in my table
>>>>>>>>>>> as a list in the setRanges() method.
>>>>>>>>>>>
>>>>>>>>>>>  This whole process takes around 50 secs(from adding the ids in
>>>>>>>>>>> the list to scanning the whole table using the BatchScanner).
>>>>>>>>>>>
>>>>>>>>>>>  I tried switching on bloom filters but that didn't work.
>>>>>>>>>>>
>>>>>>>>>>>  Also if anyone could briefly explain how a BatchScanner works,
>>>>>>>>>>> how it does parallel scanning it would help me understand what I am doing
>>>>>>>>>>> better.
>>>>>>>>>>>
>>>>>>>>>>>  Thanks
>>>>>>>>>>>  Vaibhav
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>
>

--90e6ba3fd2811bd4f105160e7672
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<p dir=3D"ltr">Dylan could you elaborate on the average query time you had?=
 <br>
Thanks <br>
Vaibhav </p>
<div class=3D"gmail_quote">On 14-May-2015 11:03 pm, &quot;Dylan Hutchison&q=
uot; &lt;<a href=3D"mailto:dhutchis@mit.edu">dhutchis@mit.edu</a>&gt; wrote=
:<br type=3D"attribution"><blockquote class=3D"gmail_quote" style=3D"margin=
:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">I=
 think this is the same issue I found for <a href=3D"https://issues.apache.=
org/jira/browse/ACCUMULO-3710" target=3D"_blank">ACCUMULO-3710</a>, only in=
 my case the tserver ran out of memory.=C2=A0 Accumulo doesn&#39;t handle l=
arge numbers of small, disjoint ranges well.=C2=A0 I bet there&#39;s room f=
or improvement on both the client and tablet server.<div>~Dylan</div><div c=
lass=3D"gmail_extra"><br><div class=3D"gmail_quote">On Wed, May 13, 2015 at=
 3:13 PM, Eric Newton <span dir=3D"ltr">&lt;<a href=3D"mailto:eric.newton@g=
mail.com" target=3D"_blank">eric.newton@gmail.com</a>&gt;</span> wrote:<br>=
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div><div><div>Yes, hot-spo=
tting does affect accumulo because you have fewer servers and caches handli=
ng your request.<br><br></div>Let&#39;s say your data is spread out, in a n=
ormal distribution from &quot;0&quot;..&quot;9&quot;. <br><br></div>What if=
 you have only 1 split?=C2=A0 You would want it at &quot;5&quot;, to divide=
 the data in half, and you could host the halves on different servers.=C2=
=A0 But if you split at 1, now 10% of your queries go to one tablet, and 90=
% go to the other.<span><font color=3D"#888888"><br><br></font></span></div=
><span><font color=3D"#888888">-Eric</font></span><div><div><br><div class=
=3D"gmail_extra"><br><div class=3D"gmail_quote">On Wed, May 13, 2015 at 1:5=
6 PM, vaibhav thapliyal <span dir=3D"ltr">&lt;<a href=3D"mailto:vaibhav.tha=
pliyal.91@gmail.com" target=3D"_blank">vaibhav.thapliyal.91@gmail.com</a>&g=
t;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0=
 .8ex;border-left:1px #ccc solid;padding-left:1ex"><p dir=3D"ltr">Thank you=
 Eric. I will surely do the same. Should uneven distribution across the tab=
lets affect querying in accumulo?=C2=A0 If this case, it is. Is this behavi=
our normal? </p><div><div>
<div class=3D"gmail_quote">On 13-May-2015 10:58 pm, &quot;Eric Newton&quot;=
 &lt;<a href=3D"mailto:eric.newton@gmail.com" target=3D"_blank">eric.newton=
@gmail.com</a>&gt; wrote:<br type=3D"attribution"><blockquote class=3D"gmai=
l_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left=
:1ex"><div dir=3D"ltr"><div><div><div>Yes, that&#39;s a great way to split =
the data evenly.<br><br></div>Also, since the data set is so small, turn on=
 data caching for your table:<br><br></div>shell&gt; config -t mytable -s t=
able.cache.block.enable=3Dtrue<br><br></div><div>You may want to increase t=
he size of your tserver JVM, and increase the size of the cache:<br><br></d=
iv><div>shell&gt; config -s tserver.cache.data.size=3D1G<br></div><div><br>=
</div><div>This will help with repeated random look-ups.<br></div><div><br>=
</div>-Eric<br></div><div class=3D"gmail_extra"><br><div class=3D"gmail_quo=
te">On Wed, May 13, 2015 at 11:31 AM, vaibhav thapliyal <span dir=3D"ltr">&=
lt;<a href=3D"mailto:vaibhav.thapliyal.91@gmail.com" target=3D"_blank">vaib=
hav.thapliyal.91@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gm=
ail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-le=
ft:1ex"><p dir=3D"ltr">Thank you Eric.=C2=A0 </p>
<p dir=3D"ltr">One thing I would like to know. Does pre-splitting the data =
play a part in querying accumulo? </p>
<p dir=3D"ltr">Because I managed to somewhat decrease the querying time.<br=
>
I did the following steps:<br>
My table was around 1.47gb so I explicity set the split parameter to 256mb =
instead of the default 1gb. </p>
<p dir=3D"ltr">So I had just 8 tablets. Now when I carried out the same que=
ry, it finished in 15s. </p>
<p dir=3D"ltr">Is it because of the split points are more evenly distribute=
d? </p>
<p dir=3D"ltr">The previous table on which the query took 50s had entries u=
nevenly distributed across the tablets. <br>
Thanks <br><span><font color=3D"#888888">
Vaibhav </font></span></p><div><div>
<div class=3D"gmail_quote">On 13-May-2015 7:43 pm, &quot;Eric Newton&quot; =
&lt;<a href=3D"mailto:eric.newton@gmail.com" target=3D"_blank">eric.newton@=
gmail.com</a>&gt; wrote:<br type=3D"attribution"><blockquote class=3D"gmail=
_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:=
1ex"><div dir=3D"ltr"><div><div><div><div>This use case is one of the thing=
s Accumulo was designed to handle well. It&#39;s the reason there is a Batc=
hScanner.<br><br></div>I&#39;ve created:<br><br><a href=3D"https://issues.a=
pache.org/jira/browse/ACCUMULO-3813" target=3D"_blank">https://issues.apach=
e.org/jira/browse/ACCUMULO-3813</a><br><br></div>so we can investigate and =
track down any problems or improvements.<br><br></div>Feel free to add any =
other details to the JIRA ticket.<br><br></div>-Eric<br><br></div><div clas=
s=3D"gmail_extra"><br><div class=3D"gmail_quote">On Wed, May 13, 2015 at 10=
:03 AM, Emilio Lahr-Vivaz <span dir=3D"ltr">&lt;<a href=3D"mailto:elahrviva=
z@ccri.com" target=3D"_blank">elahrvivaz@ccri.com</a>&gt;</span> wrote:<br>=
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
 =20
   =20
 =20
  <div bgcolor=3D"#FFFFFF" text=3D"#000000">
    It sounds like each of your ranges is an ID, e.g. a single row. I&#39;v=
e
    found that scanning lots of non-sequential single-row ranges is
    pretty slow in accumulo. Your best approach is probably to create an
    index table on whatever you are originally trying to query (assuming
    those 10000 ids came from some other query).<br>
    <br>
    Thanks,<br>
    <br>
    Emilio<div><div><br>
    <br>
    <div>On 05/13/2015 09:14 AM, vaibhav
      thapliyal wrote:<br>
    </div>
    <blockquote type=3D"cite">
      <div dir=3D"ltr">
        <div>
          <div>The rf files per tablet vary between 2 to 5 per tablet.
            The entries returned to me by the batchScanner is 460000.
            The approx. average data rate is 0.5 MB/s as seen on the
            accumulo monitor page.<br>
            <br>
          </div>
          <div>A simple scan on the table has an average data rate of
            about 7-8 MB/s.<br>
            <br>
          </div>
          All the ids exist in the accumulo table.<br>
        </div>
      </div>
      <div class=3D"gmail_extra"><br>
        <div class=3D"gmail_quote">On 12 May 2015 at 23:39, Keith Turner <s=
pan dir=3D"ltr">&lt;<a href=3D"mailto:keith@deenlo.com" target=3D"_blank">k=
eith@deenlo.com</a>&gt;</span>
          wrote:<br>
          <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;bord=
er-left:1px #ccc solid;padding-left:1ex">
            <div dir=3D"ltr">Do you know how much data is being brought
              back (i.e. 100 megabytes)? I am wondering what the data
              rate is in MB/s.=C2=A0 Do you know how many files per tablet
              you have?=C2=A0 Do most of the 10,000 ids you are querying fo=
r
              exist?<br>
            </div>
            <div>
              <div>
                <div class=3D"gmail_extra"><br>
                  <div class=3D"gmail_quote">On Tue, May 12, 2015 at 1:58
                    PM, vaibhav thapliyal <span dir=3D"ltr">&lt;<a href=3D"=
mailto:vaibhav.thapliyal.91@gmail.com" target=3D"_blank">vaibhav.thapliyal.=
91@gmail.com</a>&gt;</span>
                    wrote:<br>
                    <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0=
 .8ex;border-left:1px #ccc solid;padding-left:1ex">
                      <p dir=3D"ltr">I have 194 tablets. Currently I am
                        using 20 threads to create the batchscanner
                        inside the createBatchScanner method. </p>
                      <div>
                        <div>
                          <div class=3D"gmail_quote">On 12-May-2015 11:19
                            pm, &quot;Keith Turner&quot; &lt;<a href=3D"mai=
lto:keith@deenlo.com" target=3D"_blank">keith@deenlo.com</a>&gt;
                            wrote:<br type=3D"attribution">
                            <blockquote class=3D"gmail_quote" style=3D"marg=
in:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
                              <div dir=3D"ltr">
                                <div>
                                  <div>
                                    <div>How many tablets do you have?=C2=
=A0
                                      The batch scanner does not
                                      parallelize operations within a
                                      tablet.<br>
                                      <br>
                                    </div>
                                    <div>If you give the batch scanner
                                      more threads than there are
                                      tservers, it will make multilple
                                      parallel rpc calls to each tserver
                                      if the tserver has multiple
                                      tablets.=C2=A0 Each rpc may include
                                      multiple tablets and ranges for
                                      each tablet.<br>
                                    </div>
                                    <br>
                                  </div>
                                  If the batch scanner has less threads
                                  than tservers, it will make one rpc
                                  per tserver per thread.=C2=A0 Each rpc ca=
ll
                                  will include all tablets and
                                  associated ranges for that tserver.<br>
                                  <br>
                                </div>
                                Keith<br>
                                <div><br>
                                  <br>
                                </div>
                              </div>
                              <div class=3D"gmail_extra"><br>
                                <div class=3D"gmail_quote">On Tue, May 12,
                                  2015 at 1:39 PM, vaibhav thapliyal <span =
dir=3D"ltr">&lt;<a href=3D"mailto:vaibhav.thapliyal.91@gmail.com" target=3D=
"_blank">vaibhav.thapliyal.91@gmail.com</a>&gt;</span>
                                  wrote:<br>
                                  <blockquote class=3D"gmail_quote" style=
=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
                                    <div dir=3D"ltr">Hi,
                                      <div><br>
                                      </div>
                                      <div>I am using BatchScanner to
                                        scan rows from a accumulo table.
                                        The table has around 187m
                                        entries and I am using a 3 node
                                        cluster which has accumulo
                                        1.6.1.</div>
                                      <div><br>
                                      </div>
                                      <div>I have passed 10000 ids which
                                        are stored as row id in my table
                                        as a list in the setRanges()
                                        method.</div>
                                      <div><br>
                                      </div>
                                      <div>This whole process takes
                                        around 50 secs(from adding the
                                        ids in the list to scanning the
                                        whole table using the
                                        BatchScanner).</div>
                                      <div><br>
                                      </div>
                                      <div>I tried switching on bloom
                                        filters but that didn&#39;t work.=
=C2=A0</div>
                                      <div><br>
                                      </div>
                                      <div>Also if anyone could briefly
                                        explain how a BatchScanner
                                        works, how it does parallel
                                        scanning it would help me
                                        understand what I am doing
                                        better.</div>
                                      <div><br>
                                      </div>
                                      <div>Thanks</div>
                                      <span><font color=3D"#888888">
                                          <div>Vaibhav =C2=A0 =C2=A0</div>
                                          <div><br>
                                          </div>
                                          <div><br>
                                          </div>
                                        </font></span></div>
                                  </blockquote>
                                </div>
                                <br>
                              </div>
                            </blockquote>
                          </div>
                        </div>
                      </div>
                    </blockquote>
                  </div>
                  <br>
                </div>
              </div>
            </div>
          </blockquote>
        </div>
        <br>
      </div>
    </blockquote>
    <br>
  </div></div></div>

</blockquote></div><br></div>
</blockquote></div>
</div></div></blockquote></div><br></div>
</blockquote></div>
</div></div></blockquote></div><br></div></div></div></div>
</blockquote></div><br></div></div>
</blockquote></div>

--90e6ba3fd2811bd4f105160e7672--