hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hayati Gonultas <hayati.gonul...@gmail.com>
Subject Re: HDFS2 vs MaprFS
Date Sun, 05 Jun 2016 15:05:08 GMT
Another correction about the terminology needs to be made.

i said 1gb = 1 million blocks. Pay attention to term block. it is not file.
A file may contain more than one block. Default block size 64mb so 640 mb
file will hold 10 blocks. Each file has its name ,permissions, path,
creation date and etc. These metadata is held in memory for all files but
not blocks. So it is good to have files with many blocks.

So by the terms of file count, the worst case scenerio is each file only
contained in one block. Resulting my 1gb = 1million files. Typically files
have many blocks and this count may increase.
5 Haz 2016 17:33 tarihinde "Hayati Gonultas" <hayati.gonultas@gmail.com>
yazd─▒:

> it is written 128 000 000 million in my previous post. it was incorrect
> (million million)
>
> what i mean is 128 million.
>
> 1gb raughly 1 million.
> 5 Haz 2016 16:58 tarihinde "Ascot Moss" <ascot.moss@gmail.com> yazd─▒:
>
>> HDFS2 "Limit to 50-200 million files", is it really true like what MapR
>> says?
>>
>> On Sun, Jun 5, 2016 at 7:55 PM, Hayati Gonultas <
>> hayati.gonultas@gmail.com> wrote:
>>
>>> I forgot to mention about file system limit.
>>>
>>> Yes HDFS has limit, because for the performance considirations HDFS
>>> filesystem is read from disk to RAM and rest of the work is done with RAM.
>>> So RAM should be big enough to fit the filesystem image. But HDFS has
>>> configuration options like har files (Hadoop Archive) to defeat these
>>> limitations.
>>>
>>> On Sun, Jun 5, 2016 at 11:14 AM, Ascot Moss <ascot.moss@gmail.com>
>>> wrote:
>>>
>>>> Will the the common pool of datanodes and namenode federation be a more
>>>> effective alternative in HDFS2  than multiple clusters?
>>>>
>>>> On Sun, Jun 5, 2016 at 12:19 PM, daemeon reiydelle <daemeonr@gmail.com>
>>>> wrote:
>>>>
>>>>> There are indeed many tuning points here. If the name nodes and
>>>>> journal nodes can be larger, perhaps even bonding multiple 10gbyte nics,
>>>>> one can easily scale. I did have one client where the file counts forced
>>>>> multiple clusters. But we were able to differentiate by airframe types
...
>>>>> eg fixed wing in one, rotary subsonic in another, etc.
>>>>>
>>>>> sent from my mobile
>>>>> Daemeon C.M. Reiydelle
>>>>> USA 415.501.0198
>>>>> London +44.0.20.8144.9872
>>>>> On Jun 4, 2016 2:23 PM, "Gavin Yue" <yue.yuanyuan@gmail.com> wrote:
>>>>>
>>>>>> Here is what I found on Horton website.
>>>>>>
>>>>>>
>>>>>> *Namespace scalability*
>>>>>>
>>>>>> While HDFS cluster storage scales horizontally with the addition
of
>>>>>> datanodes, the namespace does not. Currently the namespace can only
be
>>>>>> vertically scaled on a single namenode.  The namenode stores the
entire
>>>>>> file system metadata in memory. This limits the number of blocks,
files,
>>>>>> and directories supported on the file system to what can be accommodated
in
>>>>>> the memory of a single namenode. A typical large deployment at Yahoo!
>>>>>> includes an HDFS cluster with 2700-4200 datanodes with 180 million
>>>>>> files and blocks, and address ~25 PB of storage.  At Facebook, HDFS
has
>>>>>> around 2600 nodes, 300 million files and blocks, addressing up to
60PB of
>>>>>> storage. While these are very large systems and good enough for majority
of
>>>>>> Hadoop users, a few deployments that might want to grow even larger
could
>>>>>> find the namespace scalability limiting.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Jun 4, 2016, at 04:43, Ascot Moss <ascot.moss@gmail.com>
wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I read some (old?) articles from Internet about Mapr-FS vs HDFS.
>>>>>>
>>>>>> https://www.mapr.com/products/m5-features/no-namenode-architecture
>>>>>>
>>>>>> It states that HDFS Federation has
>>>>>>
>>>>>> a) "Multiple Single Points of Failure", is it really true?
>>>>>> Why MapR uses HDFS but not HDFS2 in its comparison as this would
lead
>>>>>> to an unfair comparison (or even misleading comparison)?  (HDFS was
from
>>>>>> Hadoop 1.x, the old generation) HDFS2 is available since 2013-10-15,
there
>>>>>> is no any Single Points of  Failure in HDFS2.
>>>>>>
>>>>>> b) "Limit to 50-200 million files", is it really true?
>>>>>> I have seen so many real world Hadoop Clusters with over 10PB data,
>>>>>> some even with 150PB data.  If "Limit to 50 -200 millions files"
were true
>>>>>> in HDFS2, why are there so many production Hadoop clusters in real
world?
>>>>>> how can they mange well the issue of  "Limit to 50-200 million files"?
For
>>>>>> instances,  the Facebook's "Like" implementation runs on HBase at
Web
>>>>>> Scale, I can image HBase generates huge number of files in Facbook's
Hadoop
>>>>>> cluster, the number of files in Facebook's Hadoop cluster should
be much
>>>>>> much bigger than 50-200 million.
>>>>>>
>>>>>> From my point of view, in contrast, MaprFS should have true
>>>>>> limitation up to 1T files while HDFS2 can handle true unlimited files,
>>>>>> please do correct me if I am wrong.
>>>>>>
>>>>>> c) "Performance Bottleneck", again, is it really true?
>>>>>> MaprFS does not have namenode in order to gain file system
>>>>>> performance. If without Namenode, MaprFS would lose Data Locality
which is
>>>>>> one of the beauties of Hadoop  If Data Locality is no longer available,
any
>>>>>> big data application running on MaprFS might gain some file system
>>>>>> performance but it would totally lose the true gain of performance
from
>>>>>> Data Locality provided by Hadoop's namenode (gain small lose big)
>>>>>>
>>>>>> d) "Commercial NAS required"
>>>>>> Is there any wiki/blog/discussion about Commercial NAS on Hadoop
>>>>>> Federation?
>>>>>>
>>>>>> regards
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>
>>>
>>> --
>>> Hayati Gonultas
>>>
>>
>>

Mime
View raw message