hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From onmstester onmstester <onmstes...@zoho.com>
Subject Re: Migrating from Apache Cassandra to Hbase
Date Wed, 12 Sep 2018 04:26:48 GMT
Thank you Josh and Allan, Sorry for the rush, this question was in my mind for some months!
but i thought i should be familiar good enough with one side of "vs". I've been struggling
with Cassandra since and almost forgot that there was a "vs" in my mind! One main feature
of Cassandra is that by providing one key (partition key), it could retrieve thousands of
rows with a few IOPS because that all rows related to a partition are almost in the same place
of disk. This is why having 8 partition keys, need to store one row in 8 places. Logically,
i can not think of a faster mechanism to load this amount of data other than keeping them
in the same place on disk. I wonder how using an indexing mechanism (like HBase mechanism)
would result in same performance as Cassandra for retrieving thousands of rows related to
a single partition key (architecture-wise)? because anyway it should load rows with some foreign
key (indexes) with multiple access (too many IOPS and much slower). Although, i'm going to
read HBase documents (technical and user manuals), launch a testing cluster with > 10 nodes
with my application logic on HBase and would try to tune its performance (too many questions
to ask in this forum) and whatever I've done for Apache Cassandra, But these questions, i
can't wait such a long time to get an answer for. Sent using Zoho Mail ---- On Wed, 12 Sep
2018 07:12:05 +0430 Allan Yang <allan163@apache.org> wrote ---- You can use Phoenix
+ HBase and use index in Phoenix. But since you need 8 different kind of query, you may need
to create 8 different indices and thus 8 index tables. But unlike Cassandra, you do not have
to store all the column data in all tables redundantly. On the other hand, you can use non-covered
index, making a simple mapping between the index column and the rowkey. So there won't be
8x space. For the 2nd question. In HBase, there won't be a node join-remove problem, since
the storage layer(using HDFS) and computing layer are completely separated. You don't have
to move data if a HBase node joined in or moved out. For the 3rd question, please refer to
Josh Elser in the previous relay, it is just a 'marketing trash', HBase is a high performance,
low lantancy ONLINE storage system, which has already been massively used in many real-time
production systems. Best Regards Allan Yang Josh Elser <elserj@apache.org> 于2018年9月11日周二
下午9:26写道: > Please be patient in getting a response to questinos you post to
this > list as we're all volunteers. > > On 9/8/18 2:16 AM, onmstester onmstester
wrote: > > Hi, Currently I'm using Apache Cassandra as backend for my restfull >
application. Having a cluster of 30 nodes (each having 12 cores, 64gb ram > and 6 TB disk
which 50% of the disk been used) write and read throughput is > more than satisfactory
for us. The input is a fixed set of long and int > columns which we need to query it based
on every column, so having 8 > columns there should be 8 tables based on Cassandra query
plan > recommendation. The cassandra keyspace schema would be someting like this: >
Table 1 (timebucket,col1, ...,col8, primary key(timebuecket,col1)) to > handle select *
from input where timebucket = X and col1 = Y .... Table 8 > (timebucket,col1, ...,col8,
primary key(timebuecket,col8)) So for each > input row, there would be 8X insert in Cassandra
(not considering RF) and > using TTL of 12 months, production cluster should keep about
2 Peta Bytes > of data With recommended node density for Cassandra cluster (2 TB per >
node), i need a cluster with more than 1000 nodes (which i can not afford) > So long story
short: I'm looking for an alternative to Apache Cassandra for > this application. How HBase
would solve these problem: > > > > 1. 8X data redundancy due to needed queries
> > HBase provides one intrinsic "index" over the data in your table and > that is
the "rowkey". If you need to access the same data 8 different > ways, you would need to
come up with 8 indexes. > > FWIW, this is not what I commonly see. Usually there are
2 or 3 lookups > that need to happen in the "fast path", not 8. Perhaps you need to take
> another look at your application needs? > > > 2. nodes with large data density
(30 TB data on each node if No.1 could > not be solved in HBase), how HBase would handle
compaction and node > join-remove problems while there is only 5 * 6 TB 7200 SATA Disk
available > on each node? How much Hbase needs as empty space for template files of >
compaction? > > HBase uses a distributed filesystem to ensure that data is available
to > be read by any RegionServer. Obviously, that filesystem needs to have > sufficient
capacity to write a new file which is approximately the sum > of the file sizes being compacted.
> > > 3. Also i read in some documents (including datastax's) that HBase is >
more > of a offline & data-lake backend that better not to be used as web > application
backendd which needs less than some seconds QoS in response > time. Thanks in advance Sent
using Zoho Mail > > Sounds like marketing trash to me. The entire premise around HBase's
> architecture is: > > * Low latency random writes/updates > * Low latency random
reads > * High throughput writes via batch tools (e.g. Bulk loading) > > IIRC, many
early adopters of HBase were using it in the critical-path > for web applications. >
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message