hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nathan Harkenrider <nathan.harkenri...@gmail.com>
Subject Re: Data Loss During Bulk Load
Date Wed, 24 Mar 2010 16:52:04 GMT
Hi St.Ack,

We're fairly sure each row is getting a unique key. We ran a map/reduce job
to generate a large number of keys and verified that we were not generating
duplicates.

We ran an additional job yesterday and observed similar behavior.
Essentially, two nodes are attempting to perform a compaction at the same
time. I observed the following sequence of events in the namenode logs
leading up to the compaction/split failure.

2010-03-23 13:15:04,838 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit:
ugi=root,root,bin,daemon,sys,adm,disk,wheel
ip=/172.16.30.110 cmd=mkdirs src=/hbase/content/compaction.dir/1396343335
dst=null perm=root:supergroup:rwxr-xr-x

2010-03-23 13:15:04,846 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit:
ugi=root,root,bin,daemon,sys,adm,disk,wheel
ip=/172.16.30.110 cmd=create
src=/hbase/content/compaction.dir/1396343335/4274696091024735055 dst=null
perm=root:supergroup:rw-r--r--

2010-03-23 13:15:04,853 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.allocateBlock:
/hbase/content/compaction.dir/1396343335/4274696091024735055.
blk_369387297326832004_532076

2010-03-23 13:15:05,645 WARN org.apache.hadoop.hdfs.StateChange: DIR*
NameSystem.completeFile: failed to complete
/hbase/content/compaction.dir/1396343335/2724981063004373124 because
dir.getFileBlocks() is null and pendingFile is null

2010-03-23 13:15:05,645 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 9 on 8020, call
complete(/hbase/content/compaction.dir/1396343335/2724981063004373124,
DFSClient_-1458814020) from 172.16.30.102:45672: error: java.io.IOException:
Could not complete write to file
/hbase/content/compaction.dir/1396343335/2724981063004373124 by
DFSClient_-1458814020
org.apache.hadoop.hdfs.server.namenode.NameNode.complete(NameNode.java:497)
sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

java.lang.reflect.Method.invoke(Method.java:597)
org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:966)
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:962)
java.security.AccessController.doPrivileged(Native Method)
javax.security.auth.Subject.doAs(Subject.java:396)
org.apache.hadoop.ipc.Server$Handler.run(Server.java:960)
This sequence of events seems to be related to
https://issues.apache.org/jira/browse/HBASE-2231 (hat tip to Todd Lipcon for
directing us to this JIRA issue). It's still unclear to me why we seem to be
tripping across this particular sequence of events rather frequently. Any
thoughts? Methods for mitigating the problem?

Thanks,

Nathan

On Mon, Mar 22, 2010 at 1:37 PM, Stack <stack@duboce.net> wrote:

> For sure each record in the input data is being uploaded with a unique
> key?  For example, if same rowid and column and you are asking the
> regionserver to supply the timestamp, if you add two cells with same
> row+column coordinates, they'll both end up with the same
> row/family/qualifier/timestamp key.  When you do your count, we'll
> only see the last instance added.
>
> St.Ack
>
> On Mon, Mar 22, 2010 at 8:15 AM, Nathan Harkenrider
> <nathan.harkenrider@gmail.com> wrote:
> > Thanks Ryan.
> >
> > We currently have the xceiver count set to 16k (not sure if this is too
> > high) and the fh max is 32k, and are still seeing the data loss issue.
> >
> > I'll dig through the datanode logs for errors and report back.
> >
> > Regards,
> >
> > Nathan
> >
> > On Sun, Mar 21, 2010 at 7:11 PM, Ryan Rawson <ryanobjc@gmail.com> wrote:
> >
> >> Maybe you are having HDFS capacity issues?  Check your datanode logs
> >> for any exceptions.  While you are at it, double check the xceiver
> >> count is set high (2048 is a good value) and the ulimit -n (fh max) is
> >> also reasonably high - 32k should do it.
> >>
> >> I recently ran an import of 36 hours and perfectly imported 24 billion
> >> rows into 2 tables and the row counts between the tables lined up
> >> exactly.
> >>
> >> PS: one other thing, in your close() method of your map reduce, you
> >> call HTable#flushCommits() right? right?
> >>
> >> On Sun, Mar 21, 2010 at 3:50 PM, Nathan Harkenrider
> >> <nathan.harkenrider@gmail.com> wrote:
> >> > Hi All,
> >> >
> >> > I'm currently running into data loss issues when bulk loading data
> into
> >> > HBase. I'm loading data via a Map/Reduce job that is parsing XML and
> >> > inserting rows into 2 HBase tables. The job is currently configured to
> >> run
> >> > 30 mappers concurrently (3 per node) and is inserting at a rate of
> >> > approximately 6000 rows/sec. The Map/Reduce job appears to run
> correctly,
> >> > however, when I run the HBase rowcounter job on the tables afterwards
> the
> >> > row count is less than expected. The data loss is small percentage
> wise
> >> > (~200,000 rows out of 80,000,000) but concerning nevertheless.
> >> >
> >> > I managed to locate the following errors in the regionserver logs
> related
> >> to
> >> > failed compactions and/or splits.
> >> > http://pastebin.com/5WjDpS9F
> >> >
> >> > I'm running HBase 0.20.3 and Cloudera CDH2, on CentOS 5.4. The cluster
> is
> >> > comprised of 11 machines, 1 master and 10 region servers. Each machine
> is
> >> 8
> >> > cores, 8GB ram. A
> >> >
> >> > Any advice is appreciated. Thanks,
> >> >
> >> > Nathan Harkenrider
> >> > nathan.harkenrider@gmail.com
> >> >
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message