Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@accumulo.apache.org
MIME-Version: 1.0
Reply-To: vines@apache.org
In-Reply-To: 
 <CAO39HYWgpwBOULfuMozHVfiB842-ZfDktht75KXcrgJGGpsx2A@mail.gmail.com>
References: 
 <CAO39HYUa_H=YyDpW=x1G93+huOaLquZ+eLKawRLTqpQPvbyj7A@mail.gmail.com>
	<CAMz+Dutf+D10-jhmR=DwRpwkyiN2CTx94YrX6v93Zy5u84sP4w@mail.gmail.com>
	<CAMz+DutZ_5t6CC+s4vMq5xwWYUNVTXh8CeE1820CxhPm+O5wsA@mail.gmail.com>
	<CAO39HYU_rYunmtS5ZZZtjbHS1ZrnpbnjP-r+a2wnCKkpJG0QRw@mail.gmail.com>
	<CAMz+DuthWeYS4VVws=ubOrF6VMC_6DN=rvRJiDHx4pLHZS1xLg@mail.gmail.com>
	<CAO39HYX6uuDQZJ8tY_Jo2yQqX9gOO0+pP5TDk1kyJJr70eU-Zg@mail.gmail.com>
	<CAO39HYXiF-S8xGXvQ2NeC+pqTEWiXj3zt_Ni62CPSAMhWkopgQ@mail.gmail.com>
	<CAMz+DuuFLti4vgo_w+jvTR0e6vMCf1UTxfkEKg2-6eskv1T4yA@mail.gmail.com>
	<CAO39HYXFXeiRiEV5M46FEGm_949twZD8JrqGTKWEuNQQuCivyw@mail.gmail.com>
	<CAMz+DustgMnySZxzEQb6R+587okF2zJS4cESHiP8-Js-Uf3kTA@mail.gmail.com>
	<CAO39HYWEqcGGY5x8ELueucX5yBTHY5UeAuDUa3nMa+WA=+QTnw@mail.gmail.com>
	<CAMz+Dut=oKUNYU79qwnpLrvBAdiUOOuJPkxDkR9RgiArpCNTuA@mail.gmail.com>
	<CADczPYRDNDUcJybASoDEXU4WMoXMFC-71jrO6XLzQxuD=tWJsw@mail.gmail.com>
	<CAO39HYWgpwBOULfuMozHVfiB842-ZfDktht75KXcrgJGGpsx2A@mail.gmail.com>
Date: Fri, 9 Nov 2012 13:09:49 -0500
Message-ID: 
 <CADczPYR0POEM83Kuv_sFOhXzKMsHtagc_Gs556Xq5re9HDGmLA@mail.gmail.com>
Subject: Re: Performance of table with large number of column families
From: John Vines <vines@apache.org>
To: user@accumulo.apache.org
Content-Type: multipart/alternative; boundary=14dae93407dd8ca1c004ce13dbe1

--14dae93407dd8ca1c004ce13dbe1
Content-Type: text/plain; charset=ISO-8859-1

Glad to hear. I typically advice a minimum of 2 shards per tserver. I would
say the maximum is actually based on the tablet size. Others in the country
may disagree/provide better reasoning.

Sent from my phone, pardon the typos and brevity.
On Nov 9, 2012 1:03 PM, "Anthony Fox" <adfaccuser@gmail.com> wrote:

> Ok, I reingested with 1000 rows and performance for both single record
> scans and index scans is much better.  I'm going to experiment a bit with
> the optimal number of rows.  Thanks for the help, everyone.
>
>
> On Fri, Nov 9, 2012 at 12:41 PM, John Vines <vines@apache.org> wrote:
>
>> The bloom filter checks only occur on a seek, and the way the column
>> family filter works it's it seeks and then does a few scans to see if the
>> appropriate families pop up in the short term. Bloom filter on the column
>> family would be better if you had larger rows to encourage more
>> seeks/minimize the number of rows to do bloom checks.
>>
>> The issue is that you are ultimately checking every single row for a
>> column, which is sparse. It's not that different than doing a full table
>> regex. If you had locality groups set up it would be more performant, until
>> you create locality groups for everything.
>>
>> The intersecting iterators get their performance by being able to operate
>> on large rows to avoid the penalty of checking each row. Minimize the
>> number of partitions you have and it should clear up your issues.
>>
>> John
>>
>> Sent from my phone, pardon the typos and brevity.
>> On Nov 9, 2012 12:24 PM, "William Slacum" <wilhelm.von.cloud@accumulo.net>
>> wrote:
>>
>>> I'll ask for someone to verify this comment for me (look @ u John W
>>> Vines), but the bloom filter helps when you have a discrete number of
>>> column families that will appear across many rows.
>>>
>>> On Fri, Nov 9, 2012 at 12:18 PM, Anthony Fox <adfaccuser@gmail.com>wrote:
>>>
>>>> Ah, ok, I was under the impression that this would be really fast since
>>>> I have a column family bloom filter turned on.  Is this not correct?
>>>>
>>>>
>>>> On Fri, Nov 9, 2012 at 12:15 PM, William Slacum <
>>>> wilhelm.von.cloud@accumulo.net> wrote:
>>>>
>>>>> When I said smaller of tablets, I really mean smaller number of rows
>>>>> :) My apologies.
>>>>>
>>>>> So if you're searching for a random column family in a table, like
>>>>> with a `scan -c <cf>` in the shell, it will start at row 0 and work
>>>>> sequentially up to row 10000000 until it finds the cf.
>>>>>
>>>>>
>>>>> On Fri, Nov 9, 2012 at 12:11 PM, Anthony Fox <adfaccuser@gmail.com>wrote:
>>>>>
>>>>>> This scan is without the intersecting iterator.  I'm just trying to
>>>>>> pull back a single data record at the moment which corresponds to scanning
>>>>>> for one column family.  I'll try with a smaller number of tablets, but is
>>>>>> the computation effort the same for the scan I am doing?
>>>>>>
>>>>>>
>>>>>> On Fri, Nov 9, 2012 at 12:02 PM, William Slacum <
>>>>>> wilhelm.von.cloud@accumulo.net> wrote:
>>>>>>
>>>>>>> So that means you have roughly 312.5k rows per tablet, which means
>>>>>>> about 725k column families in any given tablet. The intersecting iterator
>>>>>>> will work at a row per time, so I think at any given moment, it will be
>>>>>>> working through 32 at a time and doing a linear scan through the RFile
>>>>>>> blocks. With RFile indices, that check is usually pretty fast, but you're
>>>>>>> having go through 4 orders of magnitude more data sequentially than you can
>>>>>>> work on. If you can experiment and re-ingest with a smaller number of
>>>>>>> tablets, anywhere between 15 and 45, I think you will see better
>>>>>>> performance.
>>>>>>>
>>>>>>> On Fri, Nov 9, 2012 at 11:53 AM, Anthony Fox <adfaccuser@gmail.com>wrote:
>>>>>>>
>>>>>>>> Failed to answer the original question - 15 tablet servers, 32
>>>>>>>> tablets/splits.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Nov 9, 2012 at 11:52 AM, Anthony Fox <adfaccuser@gmail.com>wrote:
>>>>>>>>
>>>>>>>>> I've tried a number of different settings of
>>>>>>>>> table.split.threshold.  I started at 1G and bumped it down to 128M and the
>>>>>>>>> cf scan is still ~30 seconds for both.  I've also used less rows - 00000 to
>>>>>>>>> 99999 and still see similar performance numbers.  I thought the column
>>>>>>>>> family bloom filter would help deal with large row space but sparsely
>>>>>>>>> populated column space.  Is that correct?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Nov 9, 2012 at 11:49 AM, William Slacum <
>>>>>>>>> wilhelm.von.cloud@accumulo.net> wrote:
>>>>>>>>>
>>>>>>>>>> I'm more inclined to believe it's because you have to search
>>>>>>>>>> across 10M different rows to find any given column family, since they're
>>>>>>>>>> randomly, and possibly uniformly, distributed. How many tablets are you
>>>>>>>>>> searching across?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Nov 9, 2012 at 11:45 AM, Anthony Fox <
>>>>>>>>>> adfaccuser@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Yes, there are 10M possible partitions.  I do not have a hash
>>>>>>>>>>> from value to partition, the data is essentially randomly balanced across
>>>>>>>>>>> all the tablets.  Unlike the bloom filter and intersecting iterator
>>>>>>>>>>> examples, I do not have locality groups turned on and I have data in the cq
>>>>>>>>>>> and the value for both index entries and record entries.  Could this be the
>>>>>>>>>>> issue?  Each record entry has approximately 30 column qualifiers with data
>>>>>>>>>>> in the value for each.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Nov 9, 2012 at 11:41 AM, William Slacum <
>>>>>>>>>>> wilhelm.von.cloud@accumulo.net> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I guess assuming you have 10M possible partitions, if you're
>>>>>>>>>>>> using a relatively uniform hash to generate your IDs, you'll average about
>>>>>>>>>>>> 2 per partition. Do you have any index for term/value to partition? This
>>>>>>>>>>>> will help you narrow down your search space to a subset of your partitions.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Nov 9, 2012 at 11:39 AM, William Slacum <
>>>>>>>>>>>> wilhelm.von.cloud@accumulo.net> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> That shouldn't be a huge issue. How many rows/partitions do
>>>>>>>>>>>>> you have? How many do you have to scan to find the specific column
>>>>>>>>>>>>> family/doc id you want?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Nov 9, 2012 at 11:26 AM, Anthony Fox <
>>>>>>>>>>>>> adfaccuser@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I have a table set up to use the intersecting iterator pattern.  The
>>>>>>>>>>>>>> table has about 20M records which leads to 20M column families for the
>>>>>>>>>>>>>> data section - 1 unique column family per record.  The index section of
>>>>>>>>>>>>>> the table is not quite as large as the data section.  The rowkey is a
>>>>>>>>>>>>>> random padded integer partition between 0000000 and 9999999.  I turned
>>>>>>>>>>>>>> bloom filters on and used the ColumnFamilyFunctor to get performant
>>>>>>>>>>>>>> column family scans without specifying a range like in the bloom filter
>>>>>>>>>>>>>> examples in the README.  However, my column family scans (without any
>>>>>>>>>>>>>> custom iterator) are still fairly slow - ~30 seconds for a column family
>>>>>>>>>>>>>> batch scan of one record. I've also tried RowFunctor but I see similar
>>>>>>>>>>>>>> performance.  Can anyone shed any light on the performance metrics I'm
>>>>>>>>>>>>>> seeing?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Anthony
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>

--14dae93407dd8ca1c004ce13dbe1
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<p dir=3D"ltr">Glad to hear. I typically advice a minimum of 2 shards per t=
server. I would say the maximum is actually based on the tablet size. Other=
s in the country may disagree/provide better reasoning.</p>
<p dir=3D"ltr">Sent from my phone, pardon the typos and brevity.</p>
<div class=3D"gmail_quote">On Nov 9, 2012 1:03 PM, &quot;Anthony Fox&quot; =
&lt;<a href=3D"mailto:adfaccuser@gmail.com">adfaccuser@gmail.com</a>&gt; wr=
ote:<br type=3D"attribution"><blockquote class=3D"gmail_quote" style=3D"mar=
gin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Ok, I reingested with 1000 rows and performance for both single record scan=
s and index scans is much better. =A0I&#39;m going to experiment a bit with=
 the optimal number of rows. =A0Thanks for the help, everyone.<div class=3D=
"gmail_extra">

<br><br><div class=3D"gmail_quote">On Fri, Nov 9, 2012 at 12:41 PM, John Vi=
nes <span dir=3D"ltr">&lt;<a href=3D"mailto:vines@apache.org" target=3D"_bl=
ank">vines@apache.org</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_q=
uote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1e=
x">

<p dir=3D"ltr">The bloom filter checks only occur on a seek, and the way th=
e column family filter works it&#39;s it seeks and then does a few scans to=
 see if the appropriate families pop up in the short term. Bloom filter on =
the column family would be better if you had larger rows to encourage more =
seeks/minimize the number of rows to do bloom checks.</p>


<p dir=3D"ltr">The issue is that you are ultimately checking every single r=
ow for a column, which is sparse. It&#39;s not that different than doing a =
full table regex. If you had locality groups set up it would be more perfor=
mant, until you create locality groups for everything. </p>


<p dir=3D"ltr">The intersecting iterators get their performance by being ab=
le to operate on large rows to avoid the penalty of checking each row. Mini=
mize the number of partitions you have and it should clear up your issues. =
</p>


<p dir=3D"ltr">John</p>
<p dir=3D"ltr">Sent from my phone, pardon the typos and brevity.</p><div><d=
iv>
<div class=3D"gmail_quote">On Nov 9, 2012 12:24 PM, &quot;William Slacum&qu=
ot; &lt;<a href=3D"mailto:wilhelm.von.cloud@accumulo.net" target=3D"_blank"=
>wilhelm.von.cloud@accumulo.net</a>&gt; wrote:<br type=3D"attribution"><blo=
ckquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #c=
cc solid;padding-left:1ex">


I&#39;ll ask for someone to verify this comment for me (look @ u John W Vin=
es), but the bloom filter helps when you have a discrete number of column f=
amilies that will appear across many rows.<br><br><div class=3D"gmail_quote=
">


On Fri, Nov 9, 2012 at 12:18 PM, Anthony Fox <span dir=3D"ltr">&lt;<a href=
=3D"mailto:adfaccuser@gmail.com" target=3D"_blank">adfaccuser@gmail.com</a>=
&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0=
 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


Ah, ok, I was under the impression that this would be really fast since I h=
ave a column family bloom filter turned on. =A0Is this not correct?<div><di=
v><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">
On Fri, Nov 9, 2012 at 12:15 PM, William Slacum <span dir=3D"ltr">&lt;<a hr=
ef=3D"mailto:wilhelm.von.cloud@accumulo.net" target=3D"_blank">wilhelm.von.=
cloud@accumulo.net</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">When I said smaller of tablets, I really mea=
n smaller number of rows :) My apologies.<div><br></div><div>So if you&#39;=
re searching for a random column family in a table, like with a `scan -c &l=
t;cf&gt;` in the shell, it will start at row 0 and work sequentially up to =
row 10000000 until it finds the cf.<div>


<div><br>
<br><div class=3D"gmail_quote">On Fri, Nov 9, 2012 at 12:11 PM, Anthony Fox=
 <span dir=3D"ltr">&lt;<a href=3D"mailto:adfaccuser@gmail.com" target=3D"_b=
lank">adfaccuser@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gm=
ail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-le=
ft:1ex">


This scan is without the intersecting iterator. =A0I&#39;m just trying to p=
ull back a single data record at the moment which corresponds to scanning f=
or one column family. =A0I&#39;ll try with a smaller number of tablets, but=
 is the computation effort the same for the scan I am doing?<div>


<div><div class=3D"gmail_extra">
<br><br><div class=3D"gmail_quote">On Fri, Nov 9, 2012 at 12:02 PM, William=
 Slacum <span dir=3D"ltr">&lt;<a href=3D"mailto:wilhelm.von.cloud@accumulo.=
net" target=3D"_blank">wilhelm.von.cloud@accumulo.net</a>&gt;</span> wrote:=
<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">So that means you have roughly 312.5k rows p=
er tablet, which means about 725k column families in any given tablet. The =
intersecting iterator will work at a row per time, so I think at any given =
moment, it will be working through 32 at a time and doing a linear scan thr=
ough the RFile blocks. With RFile indices, that check is usually pretty fas=
t, but you&#39;re having go through 4 orders of magnitude more data sequent=
ially than you can work on. If you can experiment and re-ingest with a smal=
ler number of tablets, anywhere between 15 and 45, I think you will see bet=
ter performance.<div>


<div><div>
<br><div class=3D"gmail_quote">On Fri, Nov 9, 2012 at 11:53 AM, Anthony Fox=
 <span dir=3D"ltr">&lt;<a href=3D"mailto:adfaccuser@gmail.com" target=3D"_b=
lank">adfaccuser@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gm=
ail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-le=
ft:1ex">


Failed to answer the original question - 15 tablet servers, 32 tablets/spli=
ts.<div><div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">=
On Fri, Nov 9, 2012 at 11:52 AM, Anthony Fox <span dir=3D"ltr">&lt;<a href=
=3D"mailto:adfaccuser@gmail.com" target=3D"_blank">adfaccuser@gmail.com</a>=
&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">I&#39;ve tried a number of different setting=
s of table.split.threshold. =A0I started at 1G and bumped it down to 128M a=
nd the cf scan is still ~30 seconds for both. =A0I&#39;ve also used less ro=
ws - 00000 to 99999 and still see similar performance numbers. =A0I thought=
 the column family bloom filter would help deal with large row space but sp=
arsely populated column space. =A0Is that correct?<div>


<div><div class=3D"gmail_extra">
<br><br><div class=3D"gmail_quote">On Fri, Nov 9, 2012 at 11:49 AM, William=
 Slacum <span dir=3D"ltr">&lt;<a href=3D"mailto:wilhelm.von.cloud@accumulo.=
net" target=3D"_blank">wilhelm.von.cloud@accumulo.net</a>&gt;</span> wrote:=
<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">I&#39;m more inclined to believe it&#39;s be=
cause you have to search across 10M different rows to find any given column=
 family, since they&#39;re randomly, and possibly uniformly, distributed. H=
ow many tablets are you searching across?<div>


<div><br>
<br><div class=3D"gmail_quote">On Fri, Nov 9, 2012 at 11:45 AM, Anthony Fox=
 <span dir=3D"ltr">&lt;<a href=3D"mailto:adfaccuser@gmail.com" target=3D"_b=
lank">adfaccuser@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gm=
ail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-le=
ft:1ex">


Yes, there are 10M possible partitions. =A0I do not have a hash from value =
to partition, the data is essentially randomly balanced across all the tabl=
ets. =A0Unlike the bloom filter and intersecting iterator examples, I do no=
t have locality groups turned on and I have data in the cq and the value fo=
r both index entries and record entries. =A0Could this be the issue? =A0Eac=
h record entry has approximately 30 column qualifiers with data in the valu=
e for each.<div>


<div><div class=3D"gmail_extra">
<br><br><div class=3D"gmail_quote">On Fri, Nov 9, 2012 at 11:41 AM, William=
 Slacum <span dir=3D"ltr">&lt;<a href=3D"mailto:wilhelm.von.cloud@accumulo.=
net" target=3D"_blank">wilhelm.von.cloud@accumulo.net</a>&gt;</span> wrote:=
<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">I guess assuming you have 10M possible parti=
tions, if you&#39;re using a relatively uniform hash to generate your IDs, =
you&#39;ll average about 2 per partition. Do you have any index for term/va=
lue to partition? This will help you narrow down your search space to a sub=
set of your partitions.<div>


<div><br>
<br><div class=3D"gmail_quote">On Fri, Nov 9, 2012 at 11:39 AM, William Sla=
cum <span dir=3D"ltr">&lt;<a href=3D"mailto:wilhelm.von.cloud@accumulo.net"=
 target=3D"_blank">wilhelm.von.cloud@accumulo.net</a>&gt;</span> wrote:<br>=
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">


That shouldn&#39;t be a huge issue. How many rows/partitions do you have? H=
ow many do you have to scan to find the specific column family/doc id you w=
ant?<div><div><br><br><div class=3D"gmail_quote">
On Fri, Nov 9, 2012 at 11:26 AM, Anthony Fox <span dir=3D"ltr">&lt;<a href=
=3D"mailto:adfaccuser@gmail.com" target=3D"_blank">adfaccuser@gmail.com</a>=
&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><pre>I have a table set up to use the inters=
ecting iterator pattern.  The=20
table has about 20M records which leads to 20M column families for the=20
data section - 1 unique column family per record.  The index section of=20
the table is not quite as large as the data section.  The rowkey is a=20
random padded integer partition between 0000000 and 9999999.  I turned=20
bloom filters on and used the ColumnFamilyFunctor to get performant=20
column family scans without specifying a range like in the bloom filter=20
examples in the README.  However, my column family scans (without any=20
custom iterator) are still fairly slow - ~30 seconds for a column family=20
batch scan of one record. I&#39;ve also tried RowFunctor but I see similar=
=20
performance.  Can anyone shed any light on the performance metrics I&#39;m=
=20
seeing?

Thanks,
Anthony
</pre>
</blockquote></div><br>
</div></div></blockquote></div><br>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br>
</blockquote></div>
</div></div></blockquote></div><br></div>
</blockquote></div>

--14dae93407dd8ca1c004ce13dbe1--