Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (nike.apache.org: local policy)
From: David Poisson <David.Poisson@ca.fujitsu.com>
To: "user@hbase.apache.org" <user@hbase.apache.org>
Date: Fri, 31 May 2013 16:19:56 -0400
Subject: Best practices for loading data into hbase
Thread-Topic: Best practices for loading data into hbase
Thread-Index: AQHOXjwdoujHPnWQpUibq7WzkNHFgA==
Message-ID: 
 <A6A96B27B8649A49A2DC2990F07D6D1E056D0BD4F7@MTL-EXCHCLUS-01.Corp.FC.LOCAL>
Accept-Language: en-US, en-CA
Content-Language: en-US
acceptlanguage: en-US, en-CA
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

Hi,
     We are still very new at all of this hbase/hadoop/mapreduce stuff. We =
are looking for the best practices that will fit our requirements. We are c=
urrently using the latest cloudera vmware's (single node) for our developme=
nt tests.

The problem is as follows:=20

We have multiple sources in different format (xml, csv, etc), which are dum=
ps of existing systems. As one might think, there will be an initial "impor=
t" of the data into hbase=20
and afterwards, the systems would most likely dump whatever data they have =
accumulated since the initial import into hbase or since the last data dump=
. Another thing, we would require to have an
intermediary step, so that we can ensure all of a source's data can be succ=
essfully processed, something which would look like:

XML data file --(MR JOB)--> Intermediate (hbase table or hfile?) --(MR JOB)=
--> production tables in hbase

We're guessing we can't use something like a transaction in hbase, so we th=
ought about using a intermediate step: Is that how things are normally done=
?

As we import data into hbase, we will be populating several tables that lin=
ks data parts together (account X in System 1 =3D=3D account Y in System 2)=
 as tuples in 3 tables. Currently,=20
this is being done by a mapreduce job which reads the XML source and uses m=
ultiTableOutputFormat to "put" data into those 3 hbase tables. This method
isn't that fast using our test sample (2 minutes for 5Mb), so we are lookin=
g at optimizing the loading of data.

We have been researching bulk loading but we are unsure of a couple of thin=
gs:
Once we process an xml file and we populate our 3 "production" hbase tables=
, could we bulk load another xml file and append this new data to our 3 tab=
les or would it write over what was written before?
In order to bulk load, we need to output a file using HFileOutputFormat. Si=
nce MultiHFileOutputFormat doesn't seem to officially exist yet (still in t=
he works, right?), should we process our input xml file
with 3 MapReduce jobs instead of 1 and output an hfile for each, which we c=
ould then become our intermediate step (if all 3 hfiles were created withou=
t errors, then process was successful: bulk load
in hbase)? Can you experiment with bulk loading on a vmware? We're experien=
cing problems with partition file not being found with the following except=
ion:

java.lang.Exception: java.lang.IllegalArgumentException: Can't read partiti=
ons file
	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:404=
)
Caused by: java.lang.IllegalArgumentException: Can't read partitions file
	at org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner.setConf=
(TotalOrderPartitioner.java:108)
	at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:70)
	at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java=
:130)
	at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java=
:588)

We also tried another idea on how to speed things up: What if instead of do=
ing individual puts, we passed a list of puts to put() (eg: htable.put(putL=
ist) ). Internally in hbase, would there be less overhead vs multiple
calls to put()? It seems to be faster, however since we're not using contex=
t.write, I'm guessing this will lead to problems later on, right?

Turning off WAL on puts to speed things up isn't an option, since data loss=
 would be unacceptable, even if the chances of a failure occurring are slim=
.

Thanks, David=