hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hayati Gonultas <hayati.gonul...@gmail.com>
Subject Re: HDFS2 vs MaprFS
Date Sun, 05 Jun 2016 11:50:57 GMT

In most cases I think one cluster is enough. Since HDFS is a file system,
and with federation you may have multiple namenodes for different mount
points. So, you may mount /images/facebook to a namenode1 and
/images/instagram to namenode2, similar to linux file system mounts. With
such a way you hardly need another cluster. I do no know much about
inter-namenode read/write requests by the way.

Additionally, having a namenode is good for performance, HDFS2.0 it
supports SSDs and other kinds of storage types to be used with caching and
many other configuration options are also coming with HDFS2.0.

Last but not the least, Namenode hardware had better to be a redundant
server, which means, backed up power supplies, RAID and other redundant
options is good for Namenode hardwares, which is contrary to datanodes
whose hardware typically has no RAID and are commodity. So NAS is not
required for HDFS. Only filesystem image and edit log is stored in
filesystem, rest of the HDFS work is in RAM. It is also recommended to
store a backup of filesystem image to a safe location (for example: to NFS
mount), which can also be configured. So using NAS for reliability (to
store filesystem image/edit logs) is not making sense because in the end
rest of the work is done in RAM and if you're backing up your filesystem
image and your hardware is reliable enough (RAID, redundant power supplies,
multiple nics etc.) then SAN/NAS is not required at all, except if your
filesytem image is too big to fit on your single server. (File system image
is similar to traditional ext3/ext4/fat32/ntfs filesystems system tables,
it holds metadata so, it should fit on a single "good enough" server in
most cases).

On Sun, Jun 5, 2016 at 11:14 AM, Ascot Moss <ascot.moss@gmail.com> wrote:

> Will the the common pool of datanodes and namenode federation be a more
> effective alternative in HDFS2  than multiple clusters?
> On Sun, Jun 5, 2016 at 12:19 PM, daemeon reiydelle <daemeonr@gmail.com>
> wrote:
>> There are indeed many tuning points here. If the name nodes and journal
>> nodes can be larger, perhaps even bonding multiple 10gbyte nics, one can
>> easily scale. I did have one client where the file counts forced multiple
>> clusters. But we were able to differentiate by airframe types ... eg fixed
>> wing in one, rotary subsonic in another, etc.
>> sent from my mobile
>> Daemeon C.M. Reiydelle
>> USA 415.501.0198
>> London +
>> On Jun 4, 2016 2:23 PM, "Gavin Yue" <yue.yuanyuan@gmail.com> wrote:
>>> Here is what I found on Horton website.
>>> *Namespace scalability*
>>> While HDFS cluster storage scales horizontally with the addition of
>>> datanodes, the namespace does not. Currently the namespace can only be
>>> vertically scaled on a single namenode.  The namenode stores the entire
>>> file system metadata in memory. This limits the number of blocks, files,
>>> and directories supported on the file system to what can be accommodated in
>>> the memory of a single namenode. A typical large deployment at Yahoo!
>>> includes an HDFS cluster with 2700-4200 datanodes with 180 million
>>> files and blocks, and address ~25 PB of storage.  At Facebook, HDFS has
>>> around 2600 nodes, 300 million files and blocks, addressing up to 60PB of
>>> storage. While these are very large systems and good enough for majority of
>>> Hadoop users, a few deployments that might want to grow even larger could
>>> find the namespace scalability limiting.
>>> On Jun 4, 2016, at 04:43, Ascot Moss <ascot.moss@gmail.com> wrote:
>>> Hi,
>>> I read some (old?) articles from Internet about Mapr-FS vs HDFS.
>>> https://www.mapr.com/products/m5-features/no-namenode-architecture
>>> It states that HDFS Federation has
>>> a) "Multiple Single Points of Failure", is it really true?
>>> Why MapR uses HDFS but not HDFS2 in its comparison as this would lead to
>>> an unfair comparison (or even misleading comparison)?  (HDFS was from
>>> Hadoop 1.x, the old generation) HDFS2 is available since 2013-10-15, there
>>> is no any Single Points of  Failure in HDFS2.
>>> b) "Limit to 50-200 million files", is it really true?
>>> I have seen so many real world Hadoop Clusters with over 10PB data, some
>>> even with 150PB data.  If "Limit to 50 -200 millions files" were true in
>>> HDFS2, why are there so many production Hadoop clusters in real world? how
>>> can they mange well the issue of  "Limit to 50-200 million files"? For
>>> instances,  the Facebook's "Like" implementation runs on HBase at Web
>>> Scale, I can image HBase generates huge number of files in Facbook's Hadoop
>>> cluster, the number of files in Facebook's Hadoop cluster should be much
>>> much bigger than 50-200 million.
>>> From my point of view, in contrast, MaprFS should have true limitation
>>> up to 1T files while HDFS2 can handle true unlimited files, please do
>>> correct me if I am wrong.
>>> c) "Performance Bottleneck", again, is it really true?
>>> MaprFS does not have namenode in order to gain file system performance.
>>> If without Namenode, MaprFS would lose Data Locality which is one of the
>>> beauties of Hadoop  If Data Locality is no longer available, any big data
>>> application running on MaprFS might gain some file system performance but
>>> it would totally lose the true gain of performance from Data Locality
>>> provided by Hadoop's namenode (gain small lose big)
>>> d) "Commercial NAS required"
>>> Is there any wiki/blog/discussion about Commercial NAS on Hadoop
>>> Federation?
>>> regards

Hayati Gonultas

View raw message