hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ferdy Galema <ferdy.gal...@kalooga.com>
Subject Re: importing dataset, some problems and performance issues
Date Tue, 22 Mar 2011 00:22:32 GMT
These methods are certainly helpful, whenever I ever need to do a heavy 
import. For now I got away with manually cleaning my regions/stores and 
merging the data. I thought importing/exporting was the easy way to do 
that, but I guess that's not (yet) true.

On 03/21/2011 09:48 PM, Jean-Daniel Cryans wrote:
> What you are describing is solved usually by either:
> - pre-creating the regions so that you don't have to go through the
> "growing pains" of a new, virgin table. Use this sort of method:
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HBaseAdmin.html#createTable(org.apache.hadoop.hbase.HTableDescriptor,
> byte[][])
> - use the bulk loader: http://hbase.apache.org/bulk-loads.html
> J-D
> On Fri, Mar 18, 2011 at 5:46 AM, Ferdy Galema<ferdy.galema@kalooga.com>  wrote:
>> On second thought, removing the obsolete regionfolders was easily done by
>> hand. This way I can merge regions with the merge tool.
>> However, I'm still bothered by the (performance) issues I ran into. Any
>> advice would be helpful.
>> On 03/18/2011 11:06 AM, Ferdy Galema wrote:
>>> After exporting a tabel of about 30M rows (each row has about 500 columns,
>>> totalling 400GB of data), there were several issues when trying to import it
>>> again on an empty HBase. (HBase version is 0.90.1-CDH3B4, deployed on 15
>>> nodes. LZO is enabled.)
>>> The reason for this export/import is to both reduce the number of regions
>>> and clean up regionfolders in the table that are no longer referred to. (I
>>> can see this because of the dfs timestamps). Btw, I'm aware of the Merge
>>> tool, which can only solve the merging part. The max region size is set to
>>> 1GB, which is not an uncommon number judging by other posts considering a
>>> big data set.
>>> To eliminate some of the write bottlenecks, I already disabled writing to
>>> the WAL by modifying the import tool. (I assume writing to the WAL is not
>>> necessary during import as long no regionservers crash. If one does, I can
>>> simply recreate an empty hbase and start over.)
>>> Also, I temporarily set hbase.hstore.compactionThreshold and
>>> hbase.hstore.blockingStoreFiles excessively high in order to disable minor
>>> compactions during the time of the import. With these changes it still takes
>>> about 100 hour to import the data, opposed to the 6 hour it took to read it.
>>> The importing starts with a single region on one node, and is split when the
>>> size is exceeded. The resulting regions are spread out over the other nodes,
>>> so that not a problem. The first tasks result in regionservers sometimes
>>> blocking updates because there flushing memstores. After a while (around 10%
>>> completion of the job) the logs mostly show the "LRU Stats", and sometimes
>>> "Updating" / "Opening" statements. Although I presumely disabled minor
>>> compactions and no major compact should be running yet, sometimes I also see
>>> Compacting statements. Why is that so? In other words, what does "because
>>> Region has references on open" mean?
>>> Aside of these performance issues, tasks are failing with region offline
>>> errors. These are always regions that were just split. The map/reduce
>>> framework tolerates these errors, still I thought splitting process was
>>> transparant to the user.
>>> Please correct me if I'm wrong in any of my assumptions.
>>> Ferdy.

View raw message