Return-Path: X-Original-To: apmail-accumulo-user-archive@www.apache.org Delivered-To: apmail-accumulo-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C9AB517639 for ; Thu, 14 May 2015 17:58:25 +0000 (UTC) Received: (qmail 71627 invoked by uid 500); 14 May 2015 17:58:25 -0000 Delivered-To: apmail-accumulo-user-archive@accumulo.apache.org Received: (qmail 71578 invoked by uid 500); 14 May 2015 17:58:25 -0000 Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@accumulo.apache.org Delivered-To: mailing list user@accumulo.apache.org Received: (qmail 71568 invoked by uid 99); 14 May 2015 17:58:25 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 May 2015 17:58:25 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 2432DC4E22 for ; Thu, 14 May 2015 17:58:25 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.161 X-Spam-Level: *** X-Spam-Status: No, score=3.161 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=3, T_ANY_PILL_PRICE=0.01, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id hNDyZJu2gxJG for ; Thu, 14 May 2015 17:58:16 +0000 (UTC) Received: from mail-ig0-f179.google.com (mail-ig0-f179.google.com [209.85.213.179]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id 069F220515 for ; Thu, 14 May 2015 17:58:16 +0000 (UTC) Received: by igbpi8 with SMTP id pi8so173870352igb.1 for ; Thu, 14 May 2015 10:57:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=XYyB1H22uea0MsuEd7C5cR8/qcWjnMf0pl++ysXugmI=; b=xiJKlYVGitQv1lG9k5Z9drT8pp/gSLdciOYxPXFpbtS2HGOYGPq4kDsMKx9DeDLQjF HLQI2BK/vM9/ws3x5c80K93Yd2yj7adAHXmXDX+GBXnsTjwAxTZz34zhSDlFMVrdmGbU 3ZH4T4UR0V5C+Lu9UiEkOnt/PMBPUiapv9Hek7ajfsFPJKftQTNbnnEihN+YOuZ36FCx lWvLgNq85ZF4yYIPdJm23hn7DHmARIZ4ZzqExMcLX8XrETazH4Ysk7VQV2Uj9ToQEoit KsGJ+0wMQ9iFcKIzWwB3KfuQA5YDaTG4pmipHJ4pHMvmXRxDZnzTHGFsWTR/+Mnk5Kpe IWXg== MIME-Version: 1.0 X-Received: by 10.42.88.197 with SMTP id d5mr7059142icm.44.1431626250439; Thu, 14 May 2015 10:57:30 -0700 (PDT) Received: by 10.36.127.69 with HTTP; Thu, 14 May 2015 10:57:30 -0700 (PDT) Received: by 10.36.127.69 with HTTP; Thu, 14 May 2015 10:57:30 -0700 (PDT) In-Reply-To: References: <555359A9.2030600@ccri.com> Date: Thu, 14 May 2015 23:27:30 +0530 Message-ID: Subject: Re: BatchScanner taking too much time to scan rows From: vaibhav thapliyal To: user@accumulo.apache.org Content-Type: multipart/alternative; boundary=90e6ba3fd2811bd4f105160e7672 --90e6ba3fd2811bd4f105160e7672 Content-Type: text/plain; charset=UTF-8 Dylan could you elaborate on the average query time you had? Thanks Vaibhav On 14-May-2015 11:03 pm, "Dylan Hutchison" wrote: > I think this is the same issue I found for ACCUMULO-3710 > , only in my case > the tserver ran out of memory. Accumulo doesn't handle large numbers of > small, disjoint ranges well. I bet there's room for improvement on both > the client and tablet server. > ~Dylan > > On Wed, May 13, 2015 at 3:13 PM, Eric Newton > wrote: > >> Yes, hot-spotting does affect accumulo because you have fewer servers and >> caches handling your request. >> >> Let's say your data is spread out, in a normal distribution from >> "0".."9". >> >> What if you have only 1 split? You would want it at "5", to divide the >> data in half, and you could host the halves on different servers. But if >> you split at 1, now 10% of your queries go to one tablet, and 90% go to the >> other. >> >> -Eric >> >> >> On Wed, May 13, 2015 at 1:56 PM, vaibhav thapliyal < >> vaibhav.thapliyal.91@gmail.com> wrote: >> >>> Thank you Eric. I will surely do the same. Should uneven distribution >>> across the tablets affect querying in accumulo? If this case, it is. Is >>> this behaviour normal? >>> On 13-May-2015 10:58 pm, "Eric Newton" wrote: >>> >>>> Yes, that's a great way to split the data evenly. >>>> >>>> Also, since the data set is so small, turn on data caching for your >>>> table: >>>> >>>> shell> config -t mytable -s table.cache.block.enable=true >>>> >>>> You may want to increase the size of your tserver JVM, and increase the >>>> size of the cache: >>>> >>>> shell> config -s tserver.cache.data.size=1G >>>> >>>> This will help with repeated random look-ups. >>>> >>>> -Eric >>>> >>>> On Wed, May 13, 2015 at 11:31 AM, vaibhav thapliyal < >>>> vaibhav.thapliyal.91@gmail.com> wrote: >>>> >>>>> Thank you Eric. >>>>> >>>>> One thing I would like to know. Does pre-splitting the data play a >>>>> part in querying accumulo? >>>>> >>>>> Because I managed to somewhat decrease the querying time. >>>>> I did the following steps: >>>>> My table was around 1.47gb so I explicity set the split parameter to >>>>> 256mb instead of the default 1gb. >>>>> >>>>> So I had just 8 tablets. Now when I carried out the same query, it >>>>> finished in 15s. >>>>> >>>>> Is it because of the split points are more evenly distributed? >>>>> >>>>> The previous table on which the query took 50s had entries unevenly >>>>> distributed across the tablets. >>>>> Thanks >>>>> Vaibhav >>>>> On 13-May-2015 7:43 pm, "Eric Newton" wrote: >>>>> >>>>>> This use case is one of the things Accumulo was designed to handle >>>>>> well. It's the reason there is a BatchScanner. >>>>>> >>>>>> I've created: >>>>>> >>>>>> https://issues.apache.org/jira/browse/ACCUMULO-3813 >>>>>> >>>>>> so we can investigate and track down any problems or improvements. >>>>>> >>>>>> Feel free to add any other details to the JIRA ticket. >>>>>> >>>>>> -Eric >>>>>> >>>>>> >>>>>> On Wed, May 13, 2015 at 10:03 AM, Emilio Lahr-Vivaz < >>>>>> elahrvivaz@ccri.com> wrote: >>>>>> >>>>>>> It sounds like each of your ranges is an ID, e.g. a single row. >>>>>>> I've found that scanning lots of non-sequential single-row ranges is pretty >>>>>>> slow in accumulo. Your best approach is probably to create an index table >>>>>>> on whatever you are originally trying to query (assuming those 10000 ids >>>>>>> came from some other query). >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Emilio >>>>>>> >>>>>>> >>>>>>> On 05/13/2015 09:14 AM, vaibhav thapliyal wrote: >>>>>>> >>>>>>> The rf files per tablet vary between 2 to 5 per tablet. The >>>>>>> entries returned to me by the batchScanner is 460000. The approx. average >>>>>>> data rate is 0.5 MB/s as seen on the accumulo monitor page. >>>>>>> >>>>>>> A simple scan on the table has an average data rate of about 7-8 >>>>>>> MB/s. >>>>>>> >>>>>>> All the ids exist in the accumulo table. >>>>>>> >>>>>>> On 12 May 2015 at 23:39, Keith Turner wrote: >>>>>>> >>>>>>>> Do you know how much data is being brought back (i.e. 100 >>>>>>>> megabytes)? I am wondering what the data rate is in MB/s. Do you know how >>>>>>>> many files per tablet you have? Do most of the 10,000 ids you are querying >>>>>>>> for exist? >>>>>>>> >>>>>>>> On Tue, May 12, 2015 at 1:58 PM, vaibhav thapliyal < >>>>>>>> vaibhav.thapliyal.91@gmail.com> wrote: >>>>>>>> >>>>>>>>> I have 194 tablets. Currently I am using 20 threads to create the >>>>>>>>> batchscanner inside the createBatchScanner method. >>>>>>>>> On 12-May-2015 11:19 pm, "Keith Turner" wrote: >>>>>>>>> >>>>>>>>>> How many tablets do you have? The batch scanner does not >>>>>>>>>> parallelize operations within a tablet. >>>>>>>>>> >>>>>>>>>> If you give the batch scanner more threads than there are >>>>>>>>>> tservers, it will make multilple parallel rpc calls to each tserver if the >>>>>>>>>> tserver has multiple tablets. Each rpc may include multiple tablets and >>>>>>>>>> ranges for each tablet. >>>>>>>>>> >>>>>>>>>> If the batch scanner has less threads than tservers, it will >>>>>>>>>> make one rpc per tserver per thread. Each rpc call will include all >>>>>>>>>> tablets and associated ranges for that tserver. >>>>>>>>>> >>>>>>>>>> Keith >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, May 12, 2015 at 1:39 PM, vaibhav thapliyal < >>>>>>>>>> vaibhav.thapliyal.91@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> I am using BatchScanner to scan rows from a accumulo table. >>>>>>>>>>> The table has around 187m entries and I am using a 3 node cluster which has >>>>>>>>>>> accumulo 1.6.1. >>>>>>>>>>> >>>>>>>>>>> I have passed 10000 ids which are stored as row id in my table >>>>>>>>>>> as a list in the setRanges() method. >>>>>>>>>>> >>>>>>>>>>> This whole process takes around 50 secs(from adding the ids in >>>>>>>>>>> the list to scanning the whole table using the BatchScanner). >>>>>>>>>>> >>>>>>>>>>> I tried switching on bloom filters but that didn't work. >>>>>>>>>>> >>>>>>>>>>> Also if anyone could briefly explain how a BatchScanner works, >>>>>>>>>>> how it does parallel scanning it would help me understand what I am doing >>>>>>>>>>> better. >>>>>>>>>>> >>>>>>>>>>> Thanks >>>>>>>>>>> Vaibhav >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>> >> > --90e6ba3fd2811bd4f105160e7672 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable

Dylan could you elaborate on the average query time you had?=
Thanks
Vaibhav

On 14-May-2015 11:03 pm, "Dylan Hutchison&q= uot; <dhutchis@mit.edu> wrote= :
I= think this is the same issue I found for ACCUMULO-3710, only in= my case the tserver ran out of memory.=C2=A0 Accumulo doesn't handle l= arge numbers of small, disjoint ranges well.=C2=A0 I bet there's room f= or improvement on both the client and tablet server.
~Dylan

On Wed, May 13, 2015 at= 3:13 PM, Eric Newton <eric.newton@gmail.com> wrote:
=
Yes, hot-spo= tting does affect accumulo because you have fewer servers and caches handli= ng your request.

Let's say your data is spread out, in a n= ormal distribution from "0".."9".

What if= you have only 1 split?=C2=A0 You would want it at "5", to divide= the data in half, and you could host the halves on different servers.=C2= =A0 But if you split at 1, now 10% of your queries go to one tablet, and 90= % go to the other.

-Eric


On Wed, May 13, 2015 at 1:5= 6 PM, vaibhav thapliyal <vaibhav.thapliyal.91@gmail.com&g= t; wrote:

Thank you= Eric. I will surely do the same. Should uneven distribution across the tab= lets affect querying in accumulo?=C2=A0 If this case, it is. Is this behavi= our normal?

On 13-May-2015 10:58 pm, "Eric Newton"= <eric.newton= @gmail.com> wrote:
Yes, that's a great way to split = the data evenly.

Also, since the data set is so small, turn on= data caching for your table:

shell> config -t mytable -s t= able.cache.block.enable=3Dtrue

You may want to increase t= he size of your tserver JVM, and increase the size of the cache:

shell> config -s tserver.cache.data.size=3D1G

=
This will help with repeated random look-ups.

=
-Eric

On Wed, May 13, 2015 at 11:31 AM, vaibhav thapliyal &= lt;vaib= hav.thapliyal.91@gmail.com> wrote:

Thank you Eric.=C2=A0

One thing I would like to know. Does pre-splitting the data = play a part in querying accumulo?

Because I managed to somewhat decrease the querying time. I did the following steps:
My table was around 1.47gb so I explicity set the split parameter to 256mb = instead of the default 1gb.

So I had just 8 tablets. Now when I carried out the same que= ry, it finished in 15s.

Is it because of the split points are more evenly distribute= d?

The previous table on which the query took 50s had entries u= nevenly distributed across the tablets.
Thanks
Vaibhav

On 13-May-2015 7:43 pm, "Eric Newton" = <eric.newton@= gmail.com> wrote:
This use case is one of the thing= s Accumulo was designed to handle well. It's the reason there is a Batc= hScanner.

I've created:

https://issues.apach= e.org/jira/browse/ACCUMULO-3813

so we can investigate and = track down any problems or improvements.

Feel free to add any = other details to the JIRA ticket.

-Eric


On Wed, May 13, 2015 at 10= :03 AM, Emilio Lahr-Vivaz <elahrvivaz@ccri.com> wrote:
=
=20 =20 =20
It sounds like each of your ranges is an ID, e.g. a single row. I'v= e found that scanning lots of non-sequential single-row ranges is pretty slow in accumulo. Your best approach is probably to create an index table on whatever you are originally trying to query (assuming those 10000 ids came from some other query).

Thanks,

Emilio


On 05/13/2015 09:14 AM, vaibhav thapliyal wrote:
The rf files per tablet vary between 2 to 5 per tablet. The entries returned to me by the batchScanner is 460000. The approx. average data rate is 0.5 MB/s as seen on the accumulo monitor page.

A simple scan on the table has an average data rate of about 7-8 MB/s.

All the ids exist in the accumulo table.

On 12 May 2015 at 23:39, Keith Turner <k= eith@deenlo.com> wrote:
Do you know how much data is being brought back (i.e. 100 megabytes)? I am wondering what the data rate is in MB/s.=C2=A0 Do you know how many files per tablet you have?=C2=A0 Do most of the 10,000 ids you are querying fo= r exist?

On Tue, May 12, 2015 at 1:58 PM, vaibhav thapliyal <vaibhav.thapliyal.= 91@gmail.com> wrote:

I have 194 tablets. Currently I am using 20 threads to create the batchscanner inside the createBatchScanner method.

On 12-May-2015 11:19 pm, "Keith Turner" <keith@deenlo.com> wrote:
How many tablets do you have?=C2= =A0 The batch scanner does not parallelize operations within a tablet.

If you give the batch scanner more threads than there are tservers, it will make multilple parallel rpc calls to each tserver if the tserver has multiple tablets.=C2=A0 Each rpc may include multiple tablets and ranges for each tablet.

If the batch scanner has less threads than tservers, it will make one rpc per tserver per thread.=C2=A0 Each rpc ca= ll will include all tablets and associated ranges for that tserver.

Keith



On Tue, May 12, 2015 at 1:39 PM, vaibhav thapliyal <vaibhav.thapliyal.91@gmail.com> wrote:
Hi,

I am using BatchScanner to scan rows from a accumulo table. The table has around 187m entries and I am using a 3 node cluster which has accumulo 1.6.1.

I have passed 10000 ids which are stored as row id in my table as a list in the setRanges() method.

This whole process takes around 50 secs(from adding the ids in the list to scanning the whole table using the BatchScanner).

I tried switching on bloom filters but that didn't work.= =C2=A0

Also if anyone could briefly explain how a BatchScanner works, how it does parallel scanning it would help me understand what I am doing better.

Thanks
Vaibhav =C2=A0 =C2=A0










--90e6ba3fd2811bd4f105160e7672--