hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Abdelrahman Shettia <ashet...@hortonworks.com>
Subject Re: HDFS file system size issue
Date Mon, 14 Apr 2014 21:27:50 GMT
Hi Biswa, 

Are you sure that the replication factor of the files are three? Please run a ‘hadoop fsck
/ -blocks -files -locations’ and see the replication factor for each file.  Also, Post the
configuration of <name>dfs.datanode.du.reserved</name> and please check the real
space presented by a DataNode by running ‘du -h’

Thanks,
Rahman

On Apr 14, 2014, at 2:07 PM, Saumitra <saumitra.official@gmail.com> wrote:

> Hello,
> 
> Biswanath, looks like we have confusion in calculation, 1TB would be equal to 1024GB,
not 114GB.
> 
> 
> Sandeep, I checked log directory size as well. Log directories are hardly in few GBs,
I have configured log4j properties so that logs won’t be too large.
> 
> In our slave machines, we have 450GB disk partition for hadoop logs and DFS. Over there
logs directory is < 10GBs and rest space is occupied by DFS. 10GB partition is for /.
> 
> Let me quote my confusion point once again:
> 
>> Basically I wanted to point out discrepancy in name node status page and hadoop dfs
-dus. In my case, earlier one reports DFS usage as 1TB and later one reports it to be 35GB.
What are the factors that can cause this difference? And why is just 35GB data causing DFS
to hit its limits?
> 
> 
> 
> I am talking about name node status page on 50070 port. Here is the screenshot of my
name node status page
> 
> <Screen Shot 2014-04-15 at 2.07.19 am.png>
> 
> As I understand, 'DFS used’ is the space taken by DFS, non-DFS used is spaces taken
by non-DFS data like logs or other local files from users. Namenode shows that DFS used is
~1TB but hadoop dfs -dus shows it to be ~38GB.
> 
> 
> 
> On 14-Apr-2014, at 12:33 pm, Sandeep Nemuri <nhsandeep6@gmail.com> wrote:
> 
>> Please check your logs directory usage.
>> 
>> 
>> 
>> On Mon, Apr 14, 2014 at 12:08 PM, Biswajit Nayak <biswajit.nayak@inmobi.com>
wrote:
>> Whats the replication factor you have? I believe it should be 3. hadoop dus shows
that disk usage without replication. While name node ui page gives with replication. 
>> 
>> 38gb * 3 =114gb ~ 1TB
>> 
>> ~Biswa
>> -----oThe important thing is not to stop questioning o-----
>> 
>> 
>> On Mon, Apr 14, 2014 at 9:38 AM, Saumitra <saumitra.official@gmail.com> wrote:
>> Hi Biswajeet,
>> 
>> Non-dfs usage is ~100GB over the cluster. But still the number are nowhere near 1TB.

>> 
>> Basically I wanted to point out discrepancy in name node status page and hadoop dfs
-dus. In my case, earlier one reports DFS usage as 1TB and later one reports it to be 35GB.
What are the factors that can cause this difference? And why is just 35GB data causing DFS
to hit its limits?
>> 
>> 
>> 
>> 
>> On 14-Apr-2014, at 8:31 am, Biswajit Nayak <biswajit.nayak@inmobi.com> wrote:
>> 
>>> Hi Saumitra,
>>> 
>>> Could you please check the non-dfs usage. They also contribute to filling up
the disk space. 
>>> 
>>> 
>>> 
>>> ~Biswa
>>> -----oThe important thing is not to stop questioning o-----
>>> 
>>> 
>>> On Mon, Apr 14, 2014 at 1:24 AM, Saumitra <saumitra.official@gmail.com>
wrote:
>>> Hello,
>>> 
>>> We are running HDFS on 9-node hadoop cluster, hadoop version is 1.2.1. We are
using default HDFS block size.
>>> 
>>> We have noticed that disks of slaves are almost full. From name node’s status
page (namenode:50070), we could see that disks of live nodes are 90% full and DFS Used% in
cluster summary page  is ~1TB.
>>> 
>>> However hadoop dfs -dus / shows that file system size is merely 38GB. 38GB number
looks to be correct because we keep only few Hive tables and hadoop’s /tmp (distributed
cache and job outputs) in HDFS. All other data is cleaned up. I cross-checked this from hadoop
dfs -ls. Also I think that there is no internal fragmentation because the files in our Hive
tables are well-chopped in ~50MB chunks. Here are last few lines of hadoop fsck / -files -blocks
>>> 
>>> Status: HEALTHY
>>>  Total size:	38086441332 B
>>>  Total dirs:	232
>>>  Total files:	802
>>>  Total blocks (validated):	796 (avg. block size 47847288 B)
>>>  Minimally replicated blocks:	796 (100.0 %)
>>>  Over-replicated blocks:	0 (0.0 %)
>>>  Under-replicated blocks:	6 (0.75376886 %)
>>>  Mis-replicated blocks:		0 (0.0 %)
>>>  Default replication factor:	2
>>>  Average block replication:	3.0439699
>>>  Corrupt blocks:		0
>>>  Missing replicas:		6 (0.24762692 %)
>>>  Number of data-nodes:		9
>>>  Number of racks:		1
>>> FSCK ended at Sun Apr 13 19:49:23 UTC 2014 in 135 milliseconds
>>> 
>>> 
>>> My question is that why disks of slaves are getting full even though there are
only few files in DFS?
>>> 
>>> 
>>> _____________________________________________________________
>>> The information contained in this communication is intended solely for the use
of the individual or entity to whom it is addressed and others authorized to receive it. It
may contain confidential or legally privileged information. If you are not the intended recipient
you are hereby notified that any disclosure, copying, distribution or taking any action in
reliance on the contents of this information is strictly prohibited and may be unlawful. If
you have received this communication in error, please notify us immediately by responding
to this email and then delete it from your system. The firm is neither liable for the proper
and complete transmission of the information contained in this communication nor for any delay
in its receipt.
>> 
>> 
>> 
>> _____________________________________________________________
>> The information contained in this communication is intended solely for the use of
the individual or entity to whom it is addressed and others authorized to receive it. It may
contain confidential or legally privileged information. If you are not the intended recipient
you are hereby notified that any disclosure, copying, distribution or taking any action in
reliance on the contents of this information is strictly prohibited and may be unlawful. If
you have received this communication in error, please notify us immediately by responding
to this email and then delete it from your system. The firm is neither liable for the proper
and complete transmission of the information contained in this communication nor for any delay
in its receipt.
>> 
>> 
>> 
>> -- 
>> --Regards
>>   Sandeep Nemuri
> 


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Mime
View raw message