hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ferdy Galema <ferdy.gal...@kalooga.com>
Subject Re: importing dataset, some problems and performance issues
Date Fri, 18 Mar 2011 12:46:33 GMT
On second thought, removing the obsolete regionfolders was easily done 
by hand. This way I can merge regions with the merge tool.

However, I'm still bothered by the (performance) issues I ran into. Any 
advice would be helpful.

On 03/18/2011 11:06 AM, Ferdy Galema wrote:
> After exporting a tabel of about 30M rows (each row has about 500 
> columns, totalling 400GB of data), there were several issues when 
> trying to import it again on an empty HBase. (HBase version is 
> 0.90.1-CDH3B4, deployed on 15 nodes. LZO is enabled.)
>
> The reason for this export/import is to both reduce the number of 
> regions and clean up regionfolders in the table that are no longer 
> referred to. (I can see this because of the dfs timestamps). Btw, I'm 
> aware of the Merge tool, which can only solve the merging part. The 
> max region size is set to 1GB, which is not an uncommon number judging 
> by other posts considering a big data set.
>
> To eliminate some of the write bottlenecks, I already disabled writing 
> to the WAL by modifying the import tool. (I assume writing to the WAL 
> is not necessary during import as long no regionservers crash. If one 
> does, I can simply recreate an empty hbase and start over.)
>
> Also, I temporarily set hbase.hstore.compactionThreshold and 
> hbase.hstore.blockingStoreFiles excessively high in order to disable 
> minor compactions during the time of the import. With these changes it 
> still takes about 100 hour to import the data, opposed to the 6 hour 
> it took to read it. The importing starts with a single region on one 
> node, and is split when the size is exceeded. The resulting regions 
> are spread out over the other nodes, so that not a problem. The first 
> tasks result in regionservers sometimes blocking updates because there 
> flushing memstores. After a while (around 10% completion of the job) 
> the logs mostly show the "LRU Stats", and sometimes "Updating" / 
> "Opening" statements. Although I presumely disabled minor compactions 
> and no major compact should be running yet, sometimes I also see 
> Compacting statements. Why is that so? In other words, what does 
> "because Region has references on open" mean?
>
> Aside of these performance issues, tasks are failing with region 
> offline errors. These are always regions that were just split. The 
> map/reduce framework tolerates these errors, still I thought splitting 
> process was transparant to the user.
>
> Please correct me if I'm wrong in any of my assumptions.
>
> Ferdy.

Mime
View raw message