hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: Best practices for loading data into hbase
Date Fri, 31 May 2013 20:23:37 GMT
You cannot use the local job tracker (that is, the one that gets
started if you don't have one running) with the TotalOrderPartitioner.

You'll need to fully install hadoop on that vmware node.

Google that error to find other relevant comments.


On Fri, May 31, 2013 at 1:19 PM, David Poisson
<David.Poisson@ca.fujitsu.com> wrote:
> Hi,
>      We are still very new at all of this hbase/hadoop/mapreduce stuff. We are looking
for the best practices that will fit our requirements. We are currently using the latest cloudera
vmware's (single node) for our development tests.
> The problem is as follows:
> We have multiple sources in different format (xml, csv, etc), which are dumps of existing
systems. As one might think, there will be an initial "import" of the data into hbase
> and afterwards, the systems would most likely dump whatever data they have accumulated
since the initial import into hbase or since the last data dump. Another thing, we would require
to have an
> intermediary step, so that we can ensure all of a source's data can be successfully processed,
something which would look like:
> XML data file --(MR JOB)--> Intermediate (hbase table or hfile?) --(MR JOB)-->
production tables in hbase
> We're guessing we can't use something like a transaction in hbase, so we thought about
using a intermediate step: Is that how things are normally done?
> As we import data into hbase, we will be populating several tables that links data parts
together (account X in System 1 == account Y in System 2) as tuples in 3 tables. Currently,
> this is being done by a mapreduce job which reads the XML source and uses multiTableOutputFormat
to "put" data into those 3 hbase tables. This method
> isn't that fast using our test sample (2 minutes for 5Mb), so we are looking at optimizing
the loading of data.
> We have been researching bulk loading but we are unsure of a couple of things:
> Once we process an xml file and we populate our 3 "production" hbase tables, could we
bulk load another xml file and append this new data to our 3 tables or would it write over
what was written before?
> In order to bulk load, we need to output a file using HFileOutputFormat. Since MultiHFileOutputFormat
doesn't seem to officially exist yet (still in the works, right?), should we process our input
xml file
> with 3 MapReduce jobs instead of 1 and output an hfile for each, which we could then
become our intermediate step (if all 3 hfiles were created without errors, then process was
successful: bulk load
> in hbase)? Can you experiment with bulk loading on a vmware? We're experiencing problems
with partition file not being found with the following exception:
> java.lang.Exception: java.lang.IllegalArgumentException: Can't read partitions file
>         at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:404)
> Caused by: java.lang.IllegalArgumentException: Can't read partitions file
>         at org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner.setConf(TotalOrderPartitioner.java:108)
>         at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:70)
>         at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:130)
>         at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:588)
> We also tried another idea on how to speed things up: What if instead of doing individual
puts, we passed a list of puts to put() (eg: htable.put(putList) ). Internally in hbase, would
there be less overhead vs multiple
> calls to put()? It seems to be faster, however since we're not using context.write, I'm
guessing this will lead to problems later on, right?
> Turning off WAL on puts to speed things up isn't an option, since data loss would be
unacceptable, even if the chances of a failure occurring are slim.
> Thanks, David

View raw message