accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keith Turner <ke...@deenlo.com>
Subject Re: Scan vs Filter performance.
Date Tue, 29 Sep 2015 15:16:01 GMT
On Tue, Sep 29, 2015 at 12:59 AM, mohit.kaushik <mohit.kaushik@orkash.com>
wrote:

> Hi Keith,
>
> When we fetch a column or column family Ii seems, it does not seek and
> only scan by filtering the key/value pairs. But as you said if I design a
> custom iterator to fetch a column family, It may work faster.
>

When column families are fetched, Accumulo will seek[1].  It tries to read
10 cells and then seeks.

When fetching family and qualifier, two iterators are used.  The
ColumnFamilySkippingIterator and ColumnQualifierFilter.  The
ColumnQualifierFilter does a scan of all qualifers within a family [2].
The system configures the qualifier filter to have the family skipping iter
as a source[3], so it could still seek between families.


>
> But I want to know what would be the scenario if I define a locality group
> for the column family and run the same custom iterator on it which scan and
> seeks both? what would be he impact on performance (gain or loss)?
>

Like Josh said, it really depends on your situation. Its hard to offer an
opinion w/o knowing more about the schema and the queries.

Below I expanded on what Josh mentioned.

If you have a locality group, it can really help in the case where you have
many rows that have a few families.  For example if you have 10^7 rows in a
tablet and only 10^3 have a certain column family thats in a locality
group, it can make it very fast to find those 1000 rows.  W/o a locality
group even w/ seeking, you would still be seeking to each row.

Conversely if you have 10^2 rows in a tablet, each having many families.
If there is a column family you are interested in that only exist in 10
rows, you will still need to seek for each row to find it but ~100 seeks is
not so bad.



[1]:
https://github.com/apache/accumulo/blob/1.6.3/core/src/main/java/org/apache/accumulo/core/iterators/system/ColumnFamilySkippingIterator.java#L65
[2]:
https://github.com/apache/accumulo/blob/1.6.3/core/src/main/java/org/apache/accumulo/core/iterators/system/ColumnQualifierFilter.java#L54
[3]:
https://github.com/apache/accumulo/blob/1.6.3/server/tserver/src/main/java/org/apache/accumulo/tserver/Tablet.java#L2005


>
> Thanks
> Mohit Kaushik
>
>
> On 09/28/2015 10:49 PM, Moises Baly wrote:
>
> Hi Keith,
>
> No I wasn't aware of that. So I'll move forward with the custom iterator.
>
> Thank you for your time,
>
> Moises
>
> On Mon, Sep 28, 2015 at 12:35 PM, Keith Turner <keith@deenlo.com> wrote:
>
>> On Mon, Sep 28, 2015 at 12:19 PM, Moises Baly <moises@spatially.com>
>> wrote:
>>
>>> Hi all:
>>>
>>> I would like to perform a range scan on a table, tweaking the definition
>>> of what goes into a particular key range. One way I can think of is writing
>>> a filter on the key, and that would work fine. But I think it would be slow
>>> compared to a scan / seek custom iterator. How does the underlying login
>>> works? Does Filter goes through all records, or since is sorted follows the
>>> same underlying logic as a scan? Would a custom iterator perform better?
>>>
>>
>> Yes, filter will read all data.  Custom iterator that seeks may be faster.
>>
>> Are you aware of the following?
>>
>> https://issues.apache.org/jira/browse/ACCUMULO-3961
>> https://github.com/apache/accumulo/pull/42
>>
>>
>>>
>>> Thank you for your time,
>>>
>>> Moises
>>>
>>
>>
>
>
> --
>
> * Mohit Kaushik*
> Software Engineer
> A Square,Plot No. 278, Udyog Vihar, Phase 2, Gurgaon 122016, India
> *Tel:* +91 (124) 4969352 | *Fax:* +91 (124) 4033553
>
> <http://politicomapper.orkash.com>interactive social intelligence at
> work...
>
> <https://www.facebook.com/Orkash2012>
> <http://www.linkedin.com/company/orkash-services-private-limited>
> <https://twitter.com/Orkash>  <http://www.orkash.com/blog/>
> <http://www.orkash.com>
> <http://www.orkash.com> ... ensuring Assurance in complexity and
> uncertainty
>
> *This message including the attachments, if any, is a confidential
> business communication. If you are not the intended recipient it may be
> unlawful for you to read, copy, distribute, disclose or otherwise use the
> information in this e-mail. If you have received it in error or are not the
> intended recipient, please destroy it and notify the sender immediately.
> Thank you *
>

Mime
View raw message