From hadoopman <hadoop...@gmail.com>
Subject hadoop/hive data loading
Date Tue, 10 May 2011 17:56:19 GMT
When we load data into hive sometimes we've run into situations where 
the load fails and the logs show a heap out of memory error.  If I load 
just a few days (or months) of data then no problem.  But then if I try 
to load two years (for example) of data then I've seen it fail.  Not 
with every feed but certain ones.

Sometimes I've been able to split the data and get it to load.  An 
example of one type of feed I'm working on is the apache web server 
access logs.  Generally it works.  But there are times when I need to 
load more than a few months of data and get the memory heap errors in 
the task logs.

Generally how do people load their data into Hive?  We have a process 
where we first copy it to hdfs then from there we run a staging process 
to get it into hive.  Once that completes we perform a union all then 
overwrite table partition.  Usually it's during the union all stage that 
we see these errors appear.

Also is there a log which tells you which log it fails on?  I can see 
which task/job failed but not finding which file it's complaining 
about.  I figure that might help a bit..


