hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vladimir Rodionov <vrodio...@carrieriq.com>
Subject RE: subtopic for coprocessor hackathon: CCIndex
Date Thu, 09 Dec 2010 19:14:55 GMT
90M records, 118 bytes each ~ 10GB of data (w/o compression)
16 node cluster
300K records per sec = 35MB -> ~2MB per sec per node

May be there is something I missed here but these numbers really do not impress.
Old good brute force scan M/R job on 16 node grid should be much faster. 

Best regards,
Vladimir Rodionov
Principal Platform Engineer
Carrier IQ, www.carrieriq.com
e-mail: vrodionov@carrieriq.com

From: Andrew Purtell [apurtell@apache.org]
Sent: Thursday, December 09, 2010 10:06 AM
To: dev@hbase.apache.org
Subject: subtopic for coprocessor hackathon: CCIndex

While in Beijing I met with a group at the Institute of Computing at the Chinese Academy of
Sciences who are interested in contributing a secondary indexing scheme for HBase. It is my
understanding this is the same group that contributed RCFile to Hive. See at the links below
a slide deck and technical report describing what they have done, called CCIndex.

Slides: https://iridiant.s3.amazonaws.com/ccindex_v1.pdf
Paper: https://iridiant.s3.amazonaws.com/CCIndex.pdf

We discussed initially posting their code -- based on 0.20.1 -- up on GitHub and this was
agreed. This should be happening soon.

We also discussed a possible path for contribution of this work in maintainable/distributable
form as a coprocessor based reimplementation, considering support in the framework for what
CCindex needs at a low level (I/O concerns), and splitting out the rest into a coprocessor.
I've heard other talk of implementing secondary indexing using a coprocessor foundation. I
think CCIndex is one option on the table, a starting point for discussion.

Best regards,

    - Andy

View raw message