hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Bailey" <ni...@mailtrust.com>
Subject Re: Hadoop dfs usage and actual size discrepancy
Date Thu, 10 Dec 2009 00:15:36 GMT
That seems to make sense, our collection process for putting data in hadoop sees a 'Hadoop
exited with: 65280' error fairly regularly where it will fail to put a file and queue it to
be retried.  

Our logs don't seem to indicate that happening enough to justify an extra 25TB of data but
we've had a few drive/machine failures recently that could possibly explain it.

Thanks for the info!

-Nick


-----Original Message-----
From: "Todd Lipcon" <todd@cloudera.com>
Sent: Wednesday, December 9, 2009 7:02pm
To: common-user@hadoop.apache.org
Cc: core-user@hadoop.apache.org
Subject: Re: Hadoop dfs usage and actual size discrepancy

Hi Nick,

My guess is that the tmp/ directories of the DNs were rather full. I've
occasionally seen this on clusters where writes have been failing.

There should be some kind of thread which garbage collects partial blocks
from the DN's tmp dirs, but it's not implemented, as far as I know. This
comment is in FSDataset.java:

  // REMIND - mjc - eventually we should have a timeout system
  // in place to clean up block files left by abandoned clients.
  // We should have some timer in place, so that if a blockfile
  // is created but non-valid, and has been idle for >48 hours,
  // we can GC it safely.

This comment is from April 2007 ;-)

I'll file a JIRA to consider implementing this.

Thanks
-Todd

On Wed, Dec 9, 2009 at 3:57 PM, Nick Bailey <nickb@mailtrust.com> wrote:

> Actually looks like restarting has helped.  DFS used has gone down to 43TB
> from 50TB and appears to still be going down.
>
> Don't know what was wrong with the DataNode process.  Possibly a cloudera
> problem.  Thanks for the help Brian.
>
> -Nick
>
>
>
> -----Original Message-----
> From: "Nick Bailey" <nickb@mailtrust.com>
> Sent: Wednesday, December 9, 2009 5:55pm
> To: common-user@hadoop.apache.org
> Cc: common-user@hadoop.apache.org, common-user@hadoop.apache.org,
> core-user@hadoop.apache.org
> Subject: Re: Hadoop dfs usage and actual size discrepancy
>
> One interesting thing is the output of the command to restart the datanode.
>
> $ sudo service hadoop-datanode restart
> Stopping Hadoop datanode daemon (hadoop-datanode): no datanode to stop
>                                                           [  OK  ]
> Starting Hadoop datanode daemon (hadoop-datanode): starting datanode,
> logging to /log/location
>                                                           [  OK  ]
>
> Notice when stopping the datanode it says 'no datanode to stop'.  It says
> this even though the datanode is definetly running.  Also there is only 1
> datanode process, and it isn't getting stopped by this command, so basically
> I actually didn't restart anything.  I checked and at least a few of the
> other nodes are also exhibiting this behavior.
>
> I don't know if its related, after killing the process and actually
> restarting the datanode, it still doesn't appear to be clearing out any
> extra data.  I'll manually restart the datanodes by killing processes for
> now and see if maybe that helps.
>
> -Nick
>
>
> -----Original Message-----
> From: "Nick Bailey" <nickb@mailtrust.com>
> Sent: Wednesday, December 9, 2009 5:44pm
> To: common-user@hadoop.apache.org
> Cc: common-user@hadoop.apache.org, core-user@hadoop.apache.org
> Subject: Re: Hadoop dfs usage and actual size discrepancy
>
> Well for that specific machine, du pretty much matches the report.  Not all
> of our nodes are at 4.11TB that one is actually overloaded and we are
> running the balancer currently to fix it.
>
> Restarting the datanode on that machine didn't seem to clear out any data.
>  I'll probably go ahead and restart all the datanodes but I'm not hopeful to
> that clearing out all the data.
>
> Thanks for helping out though. Any other ideas out there?
>
> -Nick
>
> -----Original Message-----
> From: "Brian Bockelman" <bbockelm@cse.unl.edu>
> Sent: Wednesday, December 9, 2009 4:57pm
> To: common-user@hadoop.apache.org
> Cc: core-user@hadoop.apache.org
> Subject: Re: Hadoop dfs usage and actual size discrepancy
>
> Hey Nick,
>
> Non-DFS Used must be something new in 19.x, I guess.
>
> What happens if you do "du -hs" on the datanode directory?  Are they all
> approximately 4.11TB?  What happens after you restart a datanode?  Does it
> clean out a bunch of data?
>
> Never seen this locally, and we beat the bejesus out of our cluster...
>
> Brian
>
> On Dec 9, 2009, at 10:54 PM, Nick Bailey wrote:
>
> > Brian,
> >
> > Hadoop version 18.3. More specifically cloudera's version.  Our dfsadmin
> -report doesn't contain any lines with "Non DFS Used". so that grep won't
> work. Here is an example of the report for one of the nodes
> >
> >
> > Name: XXXXXXXXXXXXX
> > State          : In Service
> > Total raw bytes: 4919829360640 (4.47 TB)
> > Remaining raw bytes: 108009550121(100.59 GB)
> > Used raw bytes: 4520811248473 (4.11 TB)
> > % used: 91.89%
> > Last contact: Wed Dec 09 16:50:10 EST 2009
> >
> > Besides what I already posted the rest of the report is just a repeat of
> that for every node.
> >
> > Nick
> >
> > -----Original Message-----
> > From: "Brian Bockelman" <bbockelm@cse.unl.edu>
> > Sent: Wednesday, December 9, 2009 4:48pm
> > To: common-user@hadoop.apache.org
> > Cc: core-user@hadoop.apache.org
> > Subject: Re: Hadoop dfs usage and actual size discrepancy
> >
> > Hey Nick,
> >
> > What's the output of this:
> >
> > hadoop dfsadmin -report | grep "Non DFS Used" | grep -v "0 KB" | awk
> '{sum += $4} END {print sum}'
> >
> > What version of Hadoop is this?
> >
> > Brian
> >
> > On Dec 9, 2009, at 10:25 PM, Nick Bailey wrote:
> >
> >> Output from bottom of fsck report:
> >>
> >> Total size:    8711239576255 B (Total open files size: 3571494 B)
> >> Total dirs:    391731
> >> Total files:   2612976 (Files currently being written: 3)
> >> Total blocks (validated):      2274747 (avg. block size 3829542 B)
> (Total open file blocks (not validated): 1)
> >> Minimally replicated blocks:   2274747 (100.0 %)
> >> Over-replicated blocks:        75491 (3.3186548 %)
> >> Under-replicated blocks:       36945 (1.6241367 %)
> >> Mis-replicated blocks:         0 (0.0 %)
> >> Default replication factor:    3
> >> Average block replication:     3.017153
> >> Corrupt blocks:                0
> >> Missing replicas:              36945 (0.53830105 %)
> >> Number of data-nodes:          25
> >> Number of racks:               1
> >>
> >>
> >>
> >> Output from top of dfsadmin -report:
> >>
> >> Total raw bytes: 110689488793600 (100.67 TB)
> >> Remaining raw bytes: 46994184353977 (42.74 TB)
> >> Used raw bytes: 55511654282643 (50.49 TB)
> >> % used: 50.15%
> >>
> >> Total effective bytes: 0 (0 KB)
> >> Effective replication multiplier: Infinity
> >>
> >>
> >> Not sure what the last two lines fo the dfsadmin report mean, but we
> have a neglible amount of over replicated blocks according to fsck.  The
> rest of the dfsadmin report confirms what the web interface says in that the
> nodes have way more data than 8.6TB * 3.
> >>
> >> Thoughts?
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: "Brian Bockelman" <bbockelm@cse.unl.edu>
> >> Sent: Wednesday, December 9, 2009 3:35pm
> >> To: common-user@hadoop.apache.org
> >> Cc: core-user@hadoop.apache.org
> >> Subject: Re: Hadoop dfs usage and actual size discrepancy
> >>
> >> Hey Nick,
> >>
> >> Try:
> >>
> >> hadoop fsck /
> >> hadoop dfsadmin -report
> >>
> >> Should give you information about, for example, the non-HDFS data and
> the average replication factor.
> >>
> >> Or is this how you determined you had a replication factor of 3?
> >>
> >> Brian
> >>
> >> On Dec 9, 2009, at 9:33 PM, Nick Bailey wrote:
> >>
> >>> We have a hadoop cluster with a 100TB capacity, and according to the
> dfs web interface we are using 50% of our capacity (50TB).  However doing
> 'hadoop fs -dus /' says the total size of everything is  about 8.6TB.
>  Everything has a replication factor of 3 so we should only be using around
> 26TB of our cluster.
> >>>
> >>> I've verified the replication factors and I've also checked the
> datanode machines to see if something non hadoop related is accidentally
> being stored on the drives hadoop is using for storage, but nothing is.
> >>>
> >>> Has anyone had a similar problem and have any debugging suggestions?
> >>>
> >>> Thanks,
> >>> Nick Bailey
> >>>
> >>
> >>
> >
> >
>
>
>
>
>
>
>
>



Mime
View raw message