hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Atri Sharma <atri.j...@gmail.com>
Subject Benchmarking and improvement of HBase's performance for a common bulk data workload
Date Sat, 27 Apr 2013 05:20:46 GMT
Hi all,

I have been discussing with Priyank sir on the following style of
workload and whether we can improve HBase's performance in this area.
The usecase is as follows:

1) Bulk load data.
2) Query the data multiple times(read access mostly, and no real time writes).

This is a common workload, and I am pretty interested in benchmarking
HBase's performance in this area, as well as improve this further.

Please advice me on how I can proceed in benchmarking. Specifically,
how will I need to set up a HBase cluster, will there be any specific
requirements of the cluster for this type of testing?

I worked on a patch to improve performance for a similar usecase in
PostgreSQL. The case is pretty similar, bulk load of data, large
number of mostly read only queries, and then deletion of the data.

The optimization I targeted was the cost of writes to disk.
Specifically, setting of flags(hint bits) for tracking the commt
status of inserting/deleting transaction was causing a write overhead.
I tried to mitigate this by making a cache which holds the transaction
id in case of the above mentioned workload, hence mitigating the cost
of writes.

I will start benchmarking once I have the system set up and then start
thinking of tests. Once I have an outline in my mind, I shall post it
on the list.

i will require the community's guidance in this a lot.

Thoughts/Comments/Advice please?





View raw message