accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Vines <vi...@apache.org>
Subject Re: Performance of table with large number of column families
Date Fri, 09 Nov 2012 17:41:05 GMT
The bloom filter checks only occur on a seek, and the way the column family
filter works it's it seeks and then does a few scans to see if the
appropriate families pop up in the short term. Bloom filter on the column
family would be better if you had larger rows to encourage more
seeks/minimize the number of rows to do bloom checks.

The issue is that you are ultimately checking every single row for a
column, which is sparse. It's not that different than doing a full table
regex. If you had locality groups set up it would be more performant, until
you create locality groups for everything.

The intersecting iterators get their performance by being able to operate
on large rows to avoid the penalty of checking each row. Minimize the
number of partitions you have and it should clear up your issues.

John

Sent from my phone, pardon the typos and brevity.
On Nov 9, 2012 12:24 PM, "William Slacum" <wilhelm.von.cloud@accumulo.net>
wrote:

> I'll ask for someone to verify this comment for me (look @ u John W
> Vines), but the bloom filter helps when you have a discrete number of
> column families that will appear across many rows.
>
> On Fri, Nov 9, 2012 at 12:18 PM, Anthony Fox <adfaccuser@gmail.com> wrote:
>
>> Ah, ok, I was under the impression that this would be really fast since I
>> have a column family bloom filter turned on.  Is this not correct?
>>
>>
>> On Fri, Nov 9, 2012 at 12:15 PM, William Slacum <
>> wilhelm.von.cloud@accumulo.net> wrote:
>>
>>> When I said smaller of tablets, I really mean smaller number of rows :)
>>> My apologies.
>>>
>>> So if you're searching for a random column family in a table, like with
>>> a `scan -c <cf>` in the shell, it will start at row 0 and work sequentially
>>> up to row 10000000 until it finds the cf.
>>>
>>>
>>> On Fri, Nov 9, 2012 at 12:11 PM, Anthony Fox <adfaccuser@gmail.com>wrote:
>>>
>>>> This scan is without the intersecting iterator.  I'm just trying to
>>>> pull back a single data record at the moment which corresponds to scanning
>>>> for one column family.  I'll try with a smaller number of tablets, but is
>>>> the computation effort the same for the scan I am doing?
>>>>
>>>>
>>>> On Fri, Nov 9, 2012 at 12:02 PM, William Slacum <
>>>> wilhelm.von.cloud@accumulo.net> wrote:
>>>>
>>>>> So that means you have roughly 312.5k rows per tablet, which means
>>>>> about 725k column families in any given tablet. The intersecting iterator
>>>>> will work at a row per time, so I think at any given moment, it will
be
>>>>> working through 32 at a time and doing a linear scan through the RFile
>>>>> blocks. With RFile indices, that check is usually pretty fast, but you're
>>>>> having go through 4 orders of magnitude more data sequentially than you
can
>>>>> work on. If you can experiment and re-ingest with a smaller number of
>>>>> tablets, anywhere between 15 and 45, I think you will see better
>>>>> performance.
>>>>>
>>>>> On Fri, Nov 9, 2012 at 11:53 AM, Anthony Fox <adfaccuser@gmail.com>wrote:
>>>>>
>>>>>> Failed to answer the original question - 15 tablet servers, 32
>>>>>> tablets/splits.
>>>>>>
>>>>>>
>>>>>> On Fri, Nov 9, 2012 at 11:52 AM, Anthony Fox <adfaccuser@gmail.com>wrote:
>>>>>>
>>>>>>> I've tried a number of different settings of table.split.threshold.
>>>>>>>  I started at 1G and bumped it down to 128M and the cf scan is
still ~30
>>>>>>> seconds for both.  I've also used less rows - 00000 to 99999
and still see
>>>>>>> similar performance numbers.  I thought the column family bloom
filter
>>>>>>> would help deal with large row space but sparsely populated column
space.
>>>>>>>  Is that correct?
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Nov 9, 2012 at 11:49 AM, William Slacum <
>>>>>>> wilhelm.von.cloud@accumulo.net> wrote:
>>>>>>>
>>>>>>>> I'm more inclined to believe it's because you have to search
across
>>>>>>>> 10M different rows to find any given column family, since
they're randomly,
>>>>>>>> and possibly uniformly, distributed. How many tablets are
you searching
>>>>>>>> across?
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Nov 9, 2012 at 11:45 AM, Anthony Fox <adfaccuser@gmail.com>wrote:
>>>>>>>>
>>>>>>>>> Yes, there are 10M possible partitions.  I do not have
a hash from
>>>>>>>>> value to partition, the data is essentially randomly
balanced across all
>>>>>>>>> the tablets.  Unlike the bloom filter and intersecting
iterator examples, I
>>>>>>>>> do not have locality groups turned on and I have data
in the cq and the
>>>>>>>>> value for both index entries and record entries.  Could
this be the issue?
>>>>>>>>>  Each record entry has approximately 30 column qualifiers
with data in the
>>>>>>>>> value for each.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Nov 9, 2012 at 11:41 AM, William Slacum <
>>>>>>>>> wilhelm.von.cloud@accumulo.net> wrote:
>>>>>>>>>
>>>>>>>>>> I guess assuming you have 10M possible partitions,
if you're
>>>>>>>>>> using a relatively uniform hash to generate your
IDs, you'll average about
>>>>>>>>>> 2 per partition. Do you have any index for term/value
to partition? This
>>>>>>>>>> will help you narrow down your search space to a
subset of your partitions.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Nov 9, 2012 at 11:39 AM, William Slacum <
>>>>>>>>>> wilhelm.von.cloud@accumulo.net> wrote:
>>>>>>>>>>
>>>>>>>>>>> That shouldn't be a huge issue. How many rows/partitions
do you
>>>>>>>>>>> have? How many do you have to scan to find the
specific column family/doc
>>>>>>>>>>> id you want?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Nov 9, 2012 at 11:26 AM, Anthony Fox
<
>>>>>>>>>>> adfaccuser@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I have a table set up to use the intersecting
iterator pattern.  The
>>>>>>>>>>>> table has about 20M records which leads to
20M column families for the
>>>>>>>>>>>> data section - 1 unique column family per
record.  The index section of
>>>>>>>>>>>> the table is not quite as large as the data
section.  The rowkey is a
>>>>>>>>>>>> random padded integer partition between 0000000
and 9999999.  I turned
>>>>>>>>>>>> bloom filters on and used the ColumnFamilyFunctor
to get performant
>>>>>>>>>>>> column family scans without specifying a
range like in the bloom filter
>>>>>>>>>>>> examples in the README.  However, my column
family scans (without any
>>>>>>>>>>>> custom iterator) are still fairly slow -
~30 seconds for a column family
>>>>>>>>>>>> batch scan of one record. I've also tried
RowFunctor but I see similar
>>>>>>>>>>>> performance.  Can anyone shed any light on
the performance metrics I'm
>>>>>>>>>>>> seeing?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Anthony
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message