hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Asaf Mesika <asaf.mes...@gmail.com>
Subject Re: HBase Writes With Large Number of Columns
Date Wed, 27 Mar 2013 21:52:03 GMT
Correct me if I'm wrong, but you the drop is expected, according to the
following math:

If you have a Put, for a specific rowkey, and that rowkey weighs 100 bytes,
then if you have 20 columns you should add the following size to the
combined size of the columns:
20 x (100 bytes) = 2000 bytes
So the size of the Put sent to HBase should be:
1500 bytes (sum of all column qualifier size) + 20x100 (size of row key).

I add this 20x100 since, for each column qualifier, the Put object is
adding another KeyValue member object, which duplicates the RowKey.
See here (take from Put.java, v0.94.3 I think):

  public Put add(byte [] family, byte [] qualifier, long ts, byte [] value)
{

    List<KeyValue> list = getKeyValueList(family);

    KeyValue kv = createPutKeyValue(family, qualifier, ts, value);

    list.add(kv);

    familyMap.put(kv.getFamily(), list);

    return this;
  }

Each KeyValue also add more information which should also be taken into
account per Column Qualifier:
* KeyValue overhead - I think 2 longs
* Column Family length
* Timestamp - 1 long

I wrote a class to calculate a rough size of the HBase List<Put> size sent
to HBase, so I can calculate the throughput:

public class HBaseUtils {

    public static long getSize(List<? extends Row> actions) {
        long size = 0;
        for (Row row : actions) {
            size += getSize(row);
        }
        return size;
    }

    public static long getSize(Row row) {
        if (row instanceof Increment) {
            return calcSizeIncrement( (Increment) row);
        } else if (row instanceof Put) {
            return calcSizePut((Put) row);
        } else {
            throw new IllegalArgumentException("Can't calculate size for
Row type "+row.getClass());
        }
    }

    private static long calcSizePut(Put put) {
        long size = 0;
        size += put.getRow().length;

        Map<byte[], List<KeyValue>> familyMap = put.getFamilyMap();
        for (byte[] family : familyMap.keySet()) {
            size += family.length;
            List<KeyValue> kvs = familyMap.get(family);
            for (KeyValue kv : kvs) {
                size += kv.getLength();
            }
        }
        return size;

    }

    private static long calcSizeIncrement(Increment row) {
        long size = 0;

        size += row.getRow().length;

        Map<byte[], NavigableMap<byte[], Long>> familyMap =
row.getFamilyMap();
        for (byte[] family : familyMap.keySet()) {
            size += family.length;
            NavigableMap<byte[], Long> qualifiersMap =
familyMap.get(family);
            for (byte[] qualifier : qualifiersMap.keySet()) {
                size += qualifier.length;
                size += Bytes.SIZEOF_LONG;;
            }
        }

        return size;
    }
}

Feel free to use it.




On Tue, Mar 26, 2013 at 1:49 AM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> For a total of 1.5kb with 4 columns = 384 bytes/column
> bin/hbase org.apache.hadoop.hbase.util.LoadTestTool -write 4:384:100
> -num_keys 1000000
> 13/03/25 14:54:45 INFO util.MultiThreadedAction: [W:100] Keys=991664,
> cols=3,8m, time=00:03:55 Overall: [keys/s= 4218, latency=23 ms]
> Current: [keys/s=4097, latency=24 ms], insertedUpTo=-1
>
> For a total of 1.5kb with 100 columns = 15 bytes/column
> bin/hbase org.apache.hadoop.hbase.util.LoadTestTool -write 100:15:100
> -num_keys 1000000
> 13/03/25 16:27:44 INFO util.MultiThreadedAction: [W:100] Keys=999721,
> cols=95,3m, time=01:27:46 Overall: [keys/s= 189, latency=525 ms]
> Current: [keys/s=162, latency=616 ms], insertedUpTo=-1
>
> So overall, the speed is the same. A bit faster with 100 columns than
> with 4. I don't think there is any negative impact on HBase side
> because of all those columns. Might be interesting to test the same
> thing over Thrift...
>
> JM
>
> 2013/3/25 Pankaj Misra <pankaj.misra@impetus.co.in>:
> > Yes Ted, we have been observing Thrift API to clearly outperform Java
> native Hbase API, due to binary communication protocol, at higher loads.
> >
> > Tariq, the specs of the machine on which we are performing these tests
> are as given below.
> >
> > Processor : i3770K, 8 logical cores (4 physical, with 2 logical per
> physical core), 3.5 Ghz clock speed
> > RAM: 32 GB DDR3
> > HDD: Single SATA 2 TB disk, Two 250 GB SATA HDD - Total of 3 disks
> > HDFS and Hbase deployed in pseudo-distributed mode.
> > We are having 4 parallel streams writing to HBase.
> >
> > We used the same setup for the previous tests as well, and to be very
> frank, we did expect a bit of drop in performance when we had to test with
> 40 columns, but did not expect to get half the performance. When we tested
> with 20 columns, we were consistently getting a performance of 200 mbps of
> writes. But with 40 columns we are getting 90 mbps of throughput only on
> the same setup.
> >
> > Thanks and Regards
> > Pankaj Misra
> >
> >
> > ________________________________________
> > From: Ted Yu [yuzhihong@gmail.com]
> > Sent: Tuesday, March 26, 2013 1:09 AM
> > To: user@hbase.apache.org
> > Subject: Re: HBase Writes With Large Number of Columns
> >
> > bq. These records are being written using batch mutation with thrift API
> > This is an important information, I think.
> >
> > Batch mutation through Java API would incur lower overhead.
> >
> > On Mon, Mar 25, 2013 at 11:40 AM, Pankaj Misra
> > <pankaj.misra@impetus.co.in>wrote:
> >
> >> Firstly, Thanks a lot Jean and Ted for your extended help, very much
> >> appreciate it.
> >>
> >> Yes Ted I am writing to all the 40 columns and 1.5 Kb of record data is
> >> distributed across these columns.
> >>
> >> Jean, some columns are storing as small as a single byte value, while
> few
> >> of the columns are storing as much as 80-125 bytes of data. The overall
> >> record size is 1.5 KB. These records are being written using batch
> mutation
> >> with thrift API, where in we are writing 100 records per batch mutation.
> >>
> >> Thanks and Regards
> >> Pankaj Misra
> >>
> >>
> >> ________________________________________
> >> From: Jean-Marc Spaggiari [jean-marc@spaggiari.org]
> >> Sent: Monday, March 25, 2013 11:57 PM
> >> To: user@hbase.apache.org
> >> Subject: Re: HBase Writes With Large Number of Columns
> >>
> >> I just ran some LoadTest to see if I can reproduce that.
> >>
> >> bin/hbase org.apache.hadoop.hbase.util.LoadTestTool -write 4:512:100
> >> -num_keys 1000000
> >> 13/03/25 14:18:25 INFO util.MultiThreadedAction: [W:100] Keys=997172,
> >> cols=3,8m, time=00:03:55 Overall: [keys/s= 4242, latency=23 ms]
> >> Current: [keys/s=4413, latency=22 ms], insertedUpTo=-1
> >>
> >> bin/hbase org.apache.hadoop.hbase.util.LoadTestTool -write 100:512:100
> >> -num_keys 1000000
> >>
> >> This one crashed because I don't have enought disk space, so I'm
> >> re-running it, but just before it crashed it was showing about 24.5
> >> slower. which is coherent since it's writing 25 more columns.
> >>
> >> What size of data do you have? Big cells? Small cells? I will retry
> >> the test above with more lines and keep you posted.
> >>
> >> 2013/3/25 Pankaj Misra <pankaj.misra@impetus.co.in>:
> >> > Yes Ted, you are right, we are having table regions pre-split, and we
> >> see that both regions are almost evenly filled in both the tests.
> >> >
> >> > This does not seem to be a regression though, since we were getting
> good
> >> write rates when we had lesser number of columns.
> >> >
> >> > Thanks and Regards
> >> > Pankaj Misra
> >> >
> >> >
> >> > ________________________________________
> >> > From: Ted Yu [yuzhihong@gmail.com]
> >> > Sent: Monday, March 25, 2013 11:15 PM
> >> > To: user@hbase.apache.org
> >> > Cc: ankitjaincs06@gmail.com
> >> > Subject: Re: HBase Writes With Large Number of Columns
> >> >
> >> > Copying Ankit who raised the same question soon after Pankaj's initial
> >> > question.
> >> >
> >> > On one hand I wonder if this was a regression in 0.94.5 (though
> >> unlikely).
> >> >
> >> > Did the region servers receive (relatively) same write load for the
> >> second
> >> > test case ? I assume you have pre-split your tables in both cases.
> >> >
> >> > Cheers
> >> >
> >> > On Mon, Mar 25, 2013 at 10:18 AM, Pankaj Misra
> >> > <pankaj.misra@impetus.co.in>wrote:
> >> >
> >> >> Hi Ted,
> >> >>
> >> >> Sorry for missing that detail, we are using HBase version 0.94.5
> >> >>
> >> >> Regards
> >> >> Pankaj Misra
> >> >>
> >> >>
> >> >> ________________________________________
> >> >> From: Ted Yu [yuzhihong@gmail.com]
> >> >> Sent: Monday, March 25, 2013 10:29 PM
> >> >> To: user@hbase.apache.org
> >> >> Subject: Re: HBase Writes With Large Number of Columns
> >> >>
> >> >> If you give us the version of HBase you're using, that would give us
> >> some
> >> >> more information to help you.
> >> >>
> >> >> Cheers
> >> >>
> >> >> On Mon, Mar 25, 2013 at 9:55 AM, Pankaj Misra <
> >> pankaj.misra@impetus.co.in
> >> >> >wrote:
> >> >>
> >> >> > Hi,
> >> >> >
> >> >> > The issue that I am facing is around the performance drop of Hbase,
> >> when
> >> >> I
> >> >> > was having 20 columns in a column family Vs now when I am having
40
> >> >> columns
> >> >> > in a column family. The number of columns have doubled and the
> >> >> > ingestion/write speed has also dropped by half. I am writing 1.5
> KB of
> >> >> data
> >> >> > per row across 40 columns.
> >> >> >
> >> >> > Are there any settings that I should look into for tweaking Hbase
> to
> >> >> write
> >> >> > higher number of columns faster?
> >> >> >
> >> >> > I would request community's help to let me know how can I write
to
> a
> >> >> > column family with large number of columns efficiently.
> >> >> >
> >> >> > Would greatly appreciate any help /clues around this issue.
> >> >> >
> >> >> > Thanks and Regards
> >> >> > Pankaj Misra
> >> >> >
> >> >> > ________________________________
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> > NOTE: This message may contain information that is confidential,
> >> >> > proprietary, privileged or otherwise protected by law. The message
> is
> >> >> > intended solely for the named addressee. If received in error,
> please
> >> >> > destroy and notify the sender. Any use of this email is prohibited
> >> when
> >> >> > received in error. Impetus does not represent, warrant and/or
> >> guarantee,
> >> >> > that the integrity of this communication has been maintained nor
> that
> >> the
> >> >> > communication is free of errors, virus, interception or
> interference.
> >> >> >
> >> >>
> >> >> ________________________________
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> NOTE: This message may contain information that is confidential,
> >> >> proprietary, privileged or otherwise protected by law. The message
is
> >> >> intended solely for the named addressee. If received in error, please
> >> >> destroy and notify the sender. Any use of this email is prohibited
> when
> >> >> received in error. Impetus does not represent, warrant and/or
> guarantee,
> >> >> that the integrity of this communication has been maintained nor that
> >> the
> >> >> communication is free of errors, virus, interception or interference.
> >> >>
> >> >
> >> > ________________________________
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > NOTE: This message may contain information that is confidential,
> >> proprietary, privileged or otherwise protected by law. The message is
> >> intended solely for the named addressee. If received in error, please
> >> destroy and notify the sender. Any use of this email is prohibited when
> >> received in error. Impetus does not represent, warrant and/or guarantee,
> >> that the integrity of this communication has been maintained nor that
> the
> >> communication is free of errors, virus, interception or interference.
> >>
> >> ________________________________
> >>
> >>
> >>
> >>
> >>
> >>
> >> NOTE: This message may contain information that is confidential,
> >> proprietary, privileged or otherwise protected by law. The message is
> >> intended solely for the named addressee. If received in error, please
> >> destroy and notify the sender. Any use of this email is prohibited when
> >> received in error. Impetus does not represent, warrant and/or guarantee,
> >> that the integrity of this communication has been maintained nor that
> the
> >> communication is free of errors, virus, interception or interference.
> >>
> >
> > ________________________________
> >
> >
> >
> >
> >
> >
> > NOTE: This message may contain information that is confidential,
> proprietary, privileged or otherwise protected by law. The message is
> intended solely for the named addressee. If received in error, please
> destroy and notify the sender. Any use of this email is prohibited when
> received in error. Impetus does not represent, warrant and/or guarantee,
> that the integrity of this communication has been maintained nor that the
> communication is free of errors, virus, interception or interference.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message