hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: HBase Writes With Large Number of Columns
Date Wed, 27 Mar 2013 22:33:19 GMT
>From http://hbase.apache.org/book.html#hbase.rpc :

Optionally, Cells(KeyValues) can be passed outside of protobufs in
follow-behind Cell blocks (because we can’t protobuf megabytes of
KeyValues<https://docs.google.com/document/d/1WEtrq-JTIUhlnlnvA0oYRLp0F8MKpEBeBSCFcQiacdw/edit#>
or
Cells). These CellBlocks are encoded and optionally compressed.

>From IPCUtil, you should find this:

  ByteBuffer buildCellBlock(final Codec codec, final CompressionCodec
compressor,

      final CellScanner cells)
Cheers

On Wed, Mar 27, 2013 at 3:28 PM, Asaf Mesika <asaf.mesika@gmail.com> wrote:

> CellBlock == KeyValue?
>
>
> On Thu, Mar 28, 2013 at 12:06 AM, Ted Yu <yuzhihong@gmail.com> wrote:
>
> > For 0.95 and beyond, HBaseClient is able to specify codec classes that
> > encode / compress CellBlock.
> > See the following in HBaseClient#Connection :
> >
> >       builder.setCellBlockCodecClass(this.codec
> > .getClass().getCanonicalName());
> >
> >       if (this.compressor != null) {
> >
> >         builder.setCellBlockCompressorClass(this.compressor
> > .getClass().getCanonicalName());
> >
> >       }
> > Cheers
> >
> > On Wed, Mar 27, 2013 at 2:52 PM, Asaf Mesika <asaf.mesika@gmail.com>
> > wrote:
> >
> > > Correct me if I'm wrong, but you the drop is expected, according to the
> > > following math:
> > >
> > > If you have a Put, for a specific rowkey, and that rowkey weighs 100
> > bytes,
> > > then if you have 20 columns you should add the following size to the
> > > combined size of the columns:
> > > 20 x (100 bytes) = 2000 bytes
> > > So the size of the Put sent to HBase should be:
> > > 1500 bytes (sum of all column qualifier size) + 20x100 (size of row
> key).
> > >
> > > I add this 20x100 since, for each column qualifier, the Put object is
> > > adding another KeyValue member object, which duplicates the RowKey.
> > > See here (take from Put.java, v0.94.3 I think):
> > >
> > >   public Put add(byte [] family, byte [] qualifier, long ts, byte []
> > value)
> > > {
> > >
> > >     List<KeyValue> list = getKeyValueList(family);
> > >
> > >     KeyValue kv = createPutKeyValue(family, qualifier, ts, value);
> > >
> > >     list.add(kv);
> > >
> > >     familyMap.put(kv.getFamily(), list);
> > >
> > >     return this;
> > >   }
> > >
> > > Each KeyValue also add more information which should also be taken into
> > > account per Column Qualifier:
> > > * KeyValue overhead - I think 2 longs
> > > * Column Family length
> > > * Timestamp - 1 long
> > >
> > > I wrote a class to calculate a rough size of the HBase List<Put> size
> > sent
> > > to HBase, so I can calculate the throughput:
> > >
> > > public class HBaseUtils {
> > >
> > >     public static long getSize(List<? extends Row> actions) {
> > >         long size = 0;
> > >         for (Row row : actions) {
> > >             size += getSize(row);
> > >         }
> > >         return size;
> > >     }
> > >
> > >     public static long getSize(Row row) {
> > >         if (row instanceof Increment) {
> > >             return calcSizeIncrement( (Increment) row);
> > >         } else if (row instanceof Put) {
> > >             return calcSizePut((Put) row);
> > >         } else {
> > >             throw new IllegalArgumentException("Can't calculate size
> for
> > > Row type "+row.getClass());
> > >         }
> > >     }
> > >
> > >     private static long calcSizePut(Put put) {
> > >         long size = 0;
> > >         size += put.getRow().length;
> > >
> > >         Map<byte[], List<KeyValue>> familyMap = put.getFamilyMap();
> > >         for (byte[] family : familyMap.keySet()) {
> > >             size += family.length;
> > >             List<KeyValue> kvs = familyMap.get(family);
> > >             for (KeyValue kv : kvs) {
> > >                 size += kv.getLength();
> > >             }
> > >         }
> > >         return size;
> > >
> > >     }
> > >
> > >     private static long calcSizeIncrement(Increment row) {
> > >         long size = 0;
> > >
> > >         size += row.getRow().length;
> > >
> > >         Map<byte[], NavigableMap<byte[], Long>> familyMap =
> > > row.getFamilyMap();
> > >         for (byte[] family : familyMap.keySet()) {
> > >             size += family.length;
> > >             NavigableMap<byte[], Long> qualifiersMap =
> > > familyMap.get(family);
> > >             for (byte[] qualifier : qualifiersMap.keySet()) {
> > >                 size += qualifier.length;
> > >                 size += Bytes.SIZEOF_LONG;;
> > >             }
> > >         }
> > >
> > >         return size;
> > >     }
> > > }
> > >
> > > Feel free to use it.
> > >
> > >
> > >
> > >
> > > On Tue, Mar 26, 2013 at 1:49 AM, Jean-Marc Spaggiari <
> > > jean-marc@spaggiari.org> wrote:
> > >
> > > > For a total of 1.5kb with 4 columns = 384 bytes/column
> > > > bin/hbase org.apache.hadoop.hbase.util.LoadTestTool -write 4:384:100
> > > > -num_keys 1000000
> > > > 13/03/25 14:54:45 INFO util.MultiThreadedAction: [W:100] Keys=991664,
> > > > cols=3,8m, time=00:03:55 Overall: [keys/s= 4218, latency=23 ms]
> > > > Current: [keys/s=4097, latency=24 ms], insertedUpTo=-1
> > > >
> > > > For a total of 1.5kb with 100 columns = 15 bytes/column
> > > > bin/hbase org.apache.hadoop.hbase.util.LoadTestTool -write 100:15:100
> > > > -num_keys 1000000
> > > > 13/03/25 16:27:44 INFO util.MultiThreadedAction: [W:100] Keys=999721,
> > > > cols=95,3m, time=01:27:46 Overall: [keys/s= 189, latency=525 ms]
> > > > Current: [keys/s=162, latency=616 ms], insertedUpTo=-1
> > > >
> > > > So overall, the speed is the same. A bit faster with 100 columns than
> > > > with 4. I don't think there is any negative impact on HBase side
> > > > because of all those columns. Might be interesting to test the same
> > > > thing over Thrift...
> > > >
> > > > JM
> > > >
> > > > 2013/3/25 Pankaj Misra <pankaj.misra@impetus.co.in>:
> > > > > Yes Ted, we have been observing Thrift API to clearly outperform
> Java
> > > > native Hbase API, due to binary communication protocol, at higher
> > loads.
> > > > >
> > > > > Tariq, the specs of the machine on which we are performing these
> > tests
> > > > are as given below.
> > > > >
> > > > > Processor : i3770K, 8 logical cores (4 physical, with 2 logical per
> > > > physical core), 3.5 Ghz clock speed
> > > > > RAM: 32 GB DDR3
> > > > > HDD: Single SATA 2 TB disk, Two 250 GB SATA HDD - Total of 3 disks
> > > > > HDFS and Hbase deployed in pseudo-distributed mode.
> > > > > We are having 4 parallel streams writing to HBase.
> > > > >
> > > > > We used the same setup for the previous tests as well, and to be
> very
> > > > frank, we did expect a bit of drop in performance when we had to test
> > > with
> > > > 40 columns, but did not expect to get half the performance. When we
> > > tested
> > > > with 20 columns, we were consistently getting a performance of 200
> mbps
> > > of
> > > > writes. But with 40 columns we are getting 90 mbps of throughput only
> > on
> > > > the same setup.
> > > > >
> > > > > Thanks and Regards
> > > > > Pankaj Misra
> > > > >
> > > > >
> > > > > ________________________________________
> > > > > From: Ted Yu [yuzhihong@gmail.com]
> > > > > Sent: Tuesday, March 26, 2013 1:09 AM
> > > > > To: user@hbase.apache.org
> > > > > Subject: Re: HBase Writes With Large Number of Columns
> > > > >
> > > > > bq. These records are being written using batch mutation with
> thrift
> > > API
> > > > > This is an important information, I think.
> > > > >
> > > > > Batch mutation through Java API would incur lower overhead.
> > > > >
> > > > > On Mon, Mar 25, 2013 at 11:40 AM, Pankaj Misra
> > > > > <pankaj.misra@impetus.co.in>wrote:
> > > > >
> > > > >> Firstly, Thanks a lot Jean and Ted for your extended help, very
> much
> > > > >> appreciate it.
> > > > >>
> > > > >> Yes Ted I am writing to all the 40 columns and 1.5 Kb of record
> data
> > > is
> > > > >> distributed across these columns.
> > > > >>
> > > > >> Jean, some columns are storing as small as a single byte value,
> > while
> > > > few
> > > > >> of the columns are storing as much as 80-125 bytes of data. The
> > > overall
> > > > >> record size is 1.5 KB. These records are being written using
batch
> > > > mutation
> > > > >> with thrift API, where in we are writing 100 records per batch
> > > mutation.
> > > > >>
> > > > >> Thanks and Regards
> > > > >> Pankaj Misra
> > > > >>
> > > > >>
> > > > >> ________________________________________
> > > > >> From: Jean-Marc Spaggiari [jean-marc@spaggiari.org]
> > > > >> Sent: Monday, March 25, 2013 11:57 PM
> > > > >> To: user@hbase.apache.org
> > > > >> Subject: Re: HBase Writes With Large Number of Columns
> > > > >>
> > > > >> I just ran some LoadTest to see if I can reproduce that.
> > > > >>
> > > > >> bin/hbase org.apache.hadoop.hbase.util.LoadTestTool -write
> 4:512:100
> > > > >> -num_keys 1000000
> > > > >> 13/03/25 14:18:25 INFO util.MultiThreadedAction: [W:100]
> > Keys=997172,
> > > > >> cols=3,8m, time=00:03:55 Overall: [keys/s= 4242, latency=23 ms]
> > > > >> Current: [keys/s=4413, latency=22 ms], insertedUpTo=-1
> > > > >>
> > > > >> bin/hbase org.apache.hadoop.hbase.util.LoadTestTool -write
> > 100:512:100
> > > > >> -num_keys 1000000
> > > > >>
> > > > >> This one crashed because I don't have enought disk space, so
I'm
> > > > >> re-running it, but just before it crashed it was showing about
> 24.5
> > > > >> slower. which is coherent since it's writing 25 more columns.
> > > > >>
> > > > >> What size of data do you have? Big cells? Small cells? I will
> retry
> > > > >> the test above with more lines and keep you posted.
> > > > >>
> > > > >> 2013/3/25 Pankaj Misra <pankaj.misra@impetus.co.in>:
> > > > >> > Yes Ted, you are right, we are having table regions pre-split,
> and
> > > we
> > > > >> see that both regions are almost evenly filled in both the tests.
> > > > >> >
> > > > >> > This does not seem to be a regression though, since we were
> > getting
> > > > good
> > > > >> write rates when we had lesser number of columns.
> > > > >> >
> > > > >> > Thanks and Regards
> > > > >> > Pankaj Misra
> > > > >> >
> > > > >> >
> > > > >> > ________________________________________
> > > > >> > From: Ted Yu [yuzhihong@gmail.com]
> > > > >> > Sent: Monday, March 25, 2013 11:15 PM
> > > > >> > To: user@hbase.apache.org
> > > > >> > Cc: ankitjaincs06@gmail.com
> > > > >> > Subject: Re: HBase Writes With Large Number of Columns
> > > > >> >
> > > > >> > Copying Ankit who raised the same question soon after Pankaj's
> > > initial
> > > > >> > question.
> > > > >> >
> > > > >> > On one hand I wonder if this was a regression in 0.94.5
(though
> > > > >> unlikely).
> > > > >> >
> > > > >> > Did the region servers receive (relatively) same write load
for
> > the
> > > > >> second
> > > > >> > test case ? I assume you have pre-split your tables in both
> cases.
> > > > >> >
> > > > >> > Cheers
> > > > >> >
> > > > >> > On Mon, Mar 25, 2013 at 10:18 AM, Pankaj Misra
> > > > >> > <pankaj.misra@impetus.co.in>wrote:
> > > > >> >
> > > > >> >> Hi Ted,
> > > > >> >>
> > > > >> >> Sorry for missing that detail, we are using HBase version
> 0.94.5
> > > > >> >>
> > > > >> >> Regards
> > > > >> >> Pankaj Misra
> > > > >> >>
> > > > >> >>
> > > > >> >> ________________________________________
> > > > >> >> From: Ted Yu [yuzhihong@gmail.com]
> > > > >> >> Sent: Monday, March 25, 2013 10:29 PM
> > > > >> >> To: user@hbase.apache.org
> > > > >> >> Subject: Re: HBase Writes With Large Number of Columns
> > > > >> >>
> > > > >> >> If you give us the version of HBase you're using, that
would
> give
> > > us
> > > > >> some
> > > > >> >> more information to help you.
> > > > >> >>
> > > > >> >> Cheers
> > > > >> >>
> > > > >> >> On Mon, Mar 25, 2013 at 9:55 AM, Pankaj Misra <
> > > > >> pankaj.misra@impetus.co.in
> > > > >> >> >wrote:
> > > > >> >>
> > > > >> >> > Hi,
> > > > >> >> >
> > > > >> >> > The issue that I am facing is around the performance
drop of
> > > Hbase,
> > > > >> when
> > > > >> >> I
> > > > >> >> > was having 20 columns in a column family Vs now
when I am
> > having
> > > 40
> > > > >> >> columns
> > > > >> >> > in a column family. The number of columns have
doubled and
> the
> > > > >> >> > ingestion/write speed has also dropped by half.
I am writing
> > 1.5
> > > > KB of
> > > > >> >> data
> > > > >> >> > per row across 40 columns.
> > > > >> >> >
> > > > >> >> > Are there any settings that I should look into
for tweaking
> > Hbase
> > > > to
> > > > >> >> write
> > > > >> >> > higher number of columns faster?
> > > > >> >> >
> > > > >> >> > I would request community's help to let me know
how can I
> write
> > > to
> > > > a
> > > > >> >> > column family with large number of columns efficiently.
> > > > >> >> >
> > > > >> >> > Would greatly appreciate any help /clues around
this issue.
> > > > >> >> >
> > > > >> >> > Thanks and Regards
> > > > >> >> > Pankaj Misra
> > > > >> >> >
> > > > >> >> > ________________________________
> > > > >> >> >
> > > > >> >> >
> > > > >> >> >
> > > > >> >> >
> > > > >> >> >
> > > > >> >> >
> > > > >> >> > NOTE: This message may contain information that
is
> > confidential,
> > > > >> >> > proprietary, privileged or otherwise protected
by law. The
> > > message
> > > > is
> > > > >> >> > intended solely for the named addressee. If received
in
> error,
> > > > please
> > > > >> >> > destroy and notify the sender. Any use of this
email is
> > > prohibited
> > > > >> when
> > > > >> >> > received in error. Impetus does not represent,
warrant and/or
> > > > >> guarantee,
> > > > >> >> > that the integrity of this communication has been
maintained
> > nor
> > > > that
> > > > >> the
> > > > >> >> > communication is free of errors, virus, interception
or
> > > > interference.
> > > > >> >> >
> > > > >> >>
> > > > >> >> ________________________________
> > > > >> >>
> > > > >> >>
> > > > >> >>
> > > > >> >>
> > > > >> >>
> > > > >> >>
> > > > >> >> NOTE: This message may contain information that is
> confidential,
> > > > >> >> proprietary, privileged or otherwise protected by law.
The
> > message
> > > is
> > > > >> >> intended solely for the named addressee. If received
in error,
> > > please
> > > > >> >> destroy and notify the sender. Any use of this email
is
> > prohibited
> > > > when
> > > > >> >> received in error. Impetus does not represent, warrant
and/or
> > > > guarantee,
> > > > >> >> that the integrity of this communication has been maintained
> nor
> > > that
> > > > >> the
> > > > >> >> communication is free of errors, virus, interception
or
> > > interference.
> > > > >> >>
> > > > >> >
> > > > >> > ________________________________
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > NOTE: This message may contain information that is confidential,
> > > > >> proprietary, privileged or otherwise protected by law. The message
> > is
> > > > >> intended solely for the named addressee. If received in error,
> > please
> > > > >> destroy and notify the sender. Any use of this email is prohibited
> > > when
> > > > >> received in error. Impetus does not represent, warrant and/or
> > > guarantee,
> > > > >> that the integrity of this communication has been maintained
nor
> > that
> > > > the
> > > > >> communication is free of errors, virus, interception or
> > interference.
> > > > >>
> > > > >> ________________________________
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> NOTE: This message may contain information that is confidential,
> > > > >> proprietary, privileged or otherwise protected by law. The message
> > is
> > > > >> intended solely for the named addressee. If received in error,
> > please
> > > > >> destroy and notify the sender. Any use of this email is prohibited
> > > when
> > > > >> received in error. Impetus does not represent, warrant and/or
> > > guarantee,
> > > > >> that the integrity of this communication has been maintained
nor
> > that
> > > > the
> > > > >> communication is free of errors, virus, interception or
> > interference.
> > > > >>
> > > > >
> > > > > ________________________________
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > NOTE: This message may contain information that is confidential,
> > > > proprietary, privileged or otherwise protected by law. The message is
> > > > intended solely for the named addressee. If received in error, please
> > > > destroy and notify the sender. Any use of this email is prohibited
> when
> > > > received in error. Impetus does not represent, warrant and/or
> > guarantee,
> > > > that the integrity of this communication has been maintained nor that
> > the
> > > > communication is free of errors, virus, interception or interference.
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message