hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Koch <ogd...@googlemail.com>
Subject Eliminating rows with many KVs using a custom filter.
Date Fri, 17 Aug 2012 12:58:03 GMT

I implemented and deployed a custom HBase filter. All it does is omit rows
which contain more than <max> KeyValue pairs. The central part is
implementing Filter filterKeyValue():

// "excludeRow" and "numKVs" are reset in reset() method.
public ReturnCode filterKeyValue(KeyValue kv) {
if (++numKVs > maxKVs) {
excludeRow = true;
return ReturnCode.NEXT_ROW;
return ReturnCode.INCLUDE;

I was wondering if from a performance point of view it would be faster to
instead override filterRow(List<KeyValue> kvs) and have something like:

public void filterRow(List<KeyValue> kvs) {
       if (kvs.size() > maxKVs) {
            excludeRow = true

The disadvantage I see with this method is that it would have to load the
entire list of kvs for each row first to establish whether or not to drop
the row. This is potentially enough to bring down our cluster - see below.
My implementation on the other hand has the overhead of the loop.

I use this filter to eliminate abnormally large rows from the scan - rows
contain about 10 KeyValues on average with low variance but a few outlier
rows contain 1million+ KeyValue pairs. Doing a simple scan/get of those
large rows brings down our region servers (using batch is not an option).
Hence, the need to eliminate these rows as efficiently as possible from the
processing pipeline.

Thank you,


PS: My options to compare both filter variants on big data are limited
since we have only one HBase cluster - the production one ;-)

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message