hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Purtell <apurt...@apache.org>
Subject RE: subtopic for coprocessor hackathon: CCIndex
Date Sun, 12 Dec 2010 19:40:28 GMT
My interest here is thinking about how HBase Coprocessors can support their use case. Also,
to make people aware that this team at ICT CAS is interested in contributing their work. 

However, the ICT guys are not on the list so I took this question to them. Below is the response:

In micro benchmark, the throughput is almost 47000 Records/S, row size is 1KB and there are
3 nodes, so the throughput per node is 15.3MB/S.
In synthetic application benchmarks, the row size is 118 bytes, cluster throughput is 300000
Records/S. So if calculate I/O rate by record number, per node it’s 2.11MB/S.

In our opinion, there are two reasons:

In our test, we found HBase's performance is relevant to record number .For example in experiments
on the 1GB data set (micro benchmark), if we split 1GB data into 10KB*0.1M records, the performance
of get, put and scan of HBase was much better than 1KB*1M records data set.

On the other hand, in synthetic application benchmark, we use multi-dimensional range queries
to get data. For example, in our paper the query is like:
"select * from ServiceTime where (primaryKey > K1 and primaryKey < K2) and (time >
k3 and time < k4) and (service = ‘CPU Load’)".
If we choose one CCIT which indexed by "time" to scan data, the records that “primaryKey”
and “service” don't meet the requirements will be filtered and not counted in this test.
So we can't calculate I/O rate by the record number. That’s the primary reason.

Best regards,

    - Andy

--- On Thu, 12/9/10, Vladimir Rodionov <vrodionov@carrieriq.com> wrote:

> From: Vladimir Rodionov <vrodionov@carrieriq.com>
> Subject: RE: subtopic for coprocessor hackathon: CCIndex
> To: "dev@hbase.apache.org" <dev@hbase.apache.org>
> Date: Thursday, December 9, 2010, 11:14 AM
> 90M records, 118 bytes each ~ 10GB of
> data (w/o compression)
> 16 node cluster
> 300K records per sec = 35MB -> ~2MB per sec per node
> May be there is something I missed here but these numbers
> really do not impress. Old good brute force scan M/R job on
> 16 node grid should be much faster. 
> Best regards,
> Vladimir Rodionov
> Principal Platform Engineer
> Carrier IQ, www.carrieriq.com
> e-mail: vrodionov@carrieriq.com
> ________________________________________
> From: Andrew Purtell [apurtell@apache.org]
> Sent: Thursday, December 09, 2010 10:06 AM
> To: dev@hbase.apache.org
> Subject: subtopic for coprocessor hackathon: CCIndex
> While in Beijing I met with a group at the Institute of
> Computing at the Chinese Academy of Sciences who are
> interested in contributing a secondary indexing scheme for
> HBase. It is my understanding this is the same group that
> contributed RCFile to Hive. See at the links below a slide
> deck and technical report describing what they have done,
> called CCIndex.
> Slides: https://iridiant.s3.amazonaws.com/ccindex_v1.pdf
> Paper: https://iridiant.s3.amazonaws.com/CCIndex.pdf
> We discussed initially posting their code -- based on
> 0.20.1 -- up on GitHub and this was agreed. This should be
> happening soon.
> We also discussed a possible path for contribution of this
> work in maintainable/distributable form as a coprocessor
> based reimplementation, considering support in the framework
> for what CCindex needs at a low level (I/O concerns), and
> splitting out the rest into a coprocessor. I've heard other
> talk of implementing secondary indexing using a coprocessor
> foundation. I think CCIndex is one option on the table, a
> starting point for discussion.
> Best regards,
>     - Andy


View raw message