hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ulrich Staudinger <ustaudin...@activequant.com>
Subject Re: Flushing to HDFS sooner
Date Sun, 19 Feb 2012 12:58:20 GMT
Hey there,

On Sun, Feb 19, 2012 at 1:44 PM, Manuel de Ferran <manuel.deferran@gmail.com
> wrote:

> Greetings,
> on a testing platform (running HBase-0.90.3 on top of Hadoop-0.20-append),
> we did the following :
> - create a dummy table
> - put a single row
> - get this row from the shell
> - wait a few minutes
> - kill -9 the datanodes
> Because regionservers could not connect to datanodes, they shutdown.
> On restart, the row has vanished. But if we do the same and "flush 'dummy'"
> from the Shell before killing the datanodes, the row is still there.
> Is it related to WAL ? MemStores ? What happened ?
> What are the recommended settings so rows are auto-flushed or at least
> flushed more frequently ?

I can't speak for anyone else than me, but I do flush manually in sane
intervals and depending on the amount of data that I put in.

I typically store time series data in hbase and financial timeseries mean
in my case intraday market data. I did some performance tests and found
that flushing after every row insert kills write performance. Same is true
if I write many thousand rows before I do a commit. I found a good balance
(but that's data specific, I assume) in inserting 1000 rows and then
flushing. Next 1000 rows, flushing. At the end of processing data, a final
flush again. By doing so, I have never had any problems with lost data so


Ulrich Staudinger

Connect online: https://www.xing.com/profile/Ulrich_Staudinger

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message