hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Eng <a...@maprtech.com>
Subject Re: HDFS2 vs MaprFS
Date Mon, 06 Jun 2016 16:45:10 GMT
As others have answered, the number of blocks/files/directories that can be
addressed by a NameNode is limited by the amount of heap space available to
the NameNode JVM.  If you need more background on this topic, I'd suggest
reviewing various materials from Hadoop JIRA and other vendors that supply
and support HDFS.

For instance, this JIRA:
https://issues.apache.org/jira/browse/HADOOP-1687

Or, for instance, Cloudera discusses this topic:
http://www.cloudera.com/documentation/enterprise/latest/topics/admin_nn_memory_config.html

I don't intend to speak for Cloudera (obviously), but you can see on that
page:

> Cloudera recommends 1 GB of NameNode heap space per million blocks to
> account for the namespace objects
>

So, do you have >200GB of memory to give to the NameNode JVM? And do you
want to do that?  If yes, then you could probably address more than 200
million blocks.

On Mon, Jun 6, 2016 at 9:35 AM, Ascot Moss <ascot.moss@gmail.com> wrote:

> Hi Aaron, from MapR site, [now HDSF2] "Limit to 50-200 million files", is
> it really true?
>
> On Tue, Jun 7, 2016 at 12:09 AM, Aaron Eng <aeng@maprtech.com> wrote:
>
>> As I said, MapRFS has topologies.  You assign a volume (which is mounted
>> at a directory path) to a topology and in turn all the data for the volume
>> (e.g. under the directory) is stored on the storage hardware assigned to
>> the topology.
>>
>> These topological labels provide the same benefits as dfs.storage.policy
>> as well as enabling additional types of use cases.
>>
>> On Mon, Jun 6, 2016 at 9:02 AM, Ascot Moss <ascot.moss@gmail.com> wrote:
>>
>>> In HDFS2, I can find "dfs.storage.policy",  for instances, HDFS2 allows
>>> to *Apply the COLD storage policy to a directory,*
>>>  where are these features in Mapr-FS?
>>>
>>> On Mon, Jun 6, 2016 at 11:43 PM, Aaron Eng <aeng@maprtech.com> wrote:
>>>
>>>> >Since MapR  is proprietary, I find that it has many compatibility
>>>> issues in Apache open source projects
>>>>
>>>> This is faulty logic. And rather than saying it has "many compatibility
>>>> issues", perhaps you can describe one.
>>>>
>>>> Both MapRFS and HDFS are accessible through the same API.  The backend
>>>> implementations are what differs.
>>>>
>>>> >Hadoop has a built-in storage policy named COLD, where is it in
>>>> Mapr-FS?
>>>>
>>>> Long before HDFS had storage policies, MapRFS had topologies.  You can
>>>> restrict particular types of storage to a topology and then assign a volume
>>>> (subset of data stored in MapRFS) to the topology, and hence the data in
>>>> that subset would be served by whatever hardware was mapped into the
>>>> topology.
>>>>
>>>> >no to mention that Mapr-FS  loses Data-Locality.
>>>>
>>>> This statement is false.
>>>>
>>>>
>>>>
>>>> On Mon, Jun 6, 2016 at 8:32 AM, Ascot Moss <ascot.moss@gmail.com>
>>>> wrote:
>>>>
>>>>> Since MapR  is proprietary, I find that it has many compatibility
>>>>> issues in Apache open source projects, or even worse, lose Hadoop's
>>>>> features.  For instances, Hadoop has a built-in storage policy named
COLD,
>>>>> where is it in Mapr-FS? no to mention that Mapr-FS  loses Data-Locality.
>>>>>
>>>>> On Mon, Jun 6, 2016 at 11:26 PM, Ascot Moss <ascot.moss@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I don't think HDFS2 needs SAN, use the QuorumJournal approach is
much
>>>>>> better than using Shared edits directory SAN approach.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Monday, June 6, 2016, Peyman Mohajerian <mohajeri@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> It is very common practice to backup the metadata in some SAN
store.
>>>>>>> So the idea of complete loss of all the metadata is preventable.
You could
>>>>>>> lose a day worth of data if e.g. you back the metadata once a
day but you
>>>>>>> could do it more frequently. I'm not saying S3 or Azure Blob
are bad ideas.
>>>>>>>
>>>>>>> On Sun, Jun 5, 2016 at 8:19 AM, Marcin Tustin <mtustin@handybook.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> The namenode architecture is a source of fragility in HDFS.
While a
>>>>>>>> high availability deployment (with two namenodes, and a failover
mechanism)
>>>>>>>> means you're unlikely to see service interruption, it is
still possible to
>>>>>>>> have a complete loss of filesystem metadata with the loss
of two machines.
>>>>>>>>
>>>>>>>> Secondly, because HDFS identifies datanodes by their hostname/ip,
>>>>>>>> dns changes can cause havoc with HDFS (see my war story on
this here:
>>>>>>>> https://medium.com/handy-tech/renaming-hdfs-datanodes-considered-terribly-harmful-2bc2f37aabab
>>>>>>>> ).
>>>>>>>>
>>>>>>>> Also, the namenode/datanode architecture probably does contribute
>>>>>>>> to the small files problem being a problem. That said, there
are lot of
>>>>>>>> practical solutions for the small files problem.
>>>>>>>>
>>>>>>>> If you're just setting up a data infrastructure, I would
say
>>>>>>>> consider alternatives before you pick HDFS. If you run in
AWS, S3 is a good
>>>>>>>> alternative. If you run in some other cloud, it's probably
worth
>>>>>>>> considering whatever their equivalent storage system is.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Jun 4, 2016 at 7:43 AM, Ascot Moss <ascot.moss@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I read some (old?) articles from Internet about Mapr-FS
vs HDFS.
>>>>>>>>>
>>>>>>>>> https://www.mapr.com/products/m5-features/no-namenode-architecture
>>>>>>>>>
>>>>>>>>> It states that HDFS Federation has
>>>>>>>>>
>>>>>>>>> a) "Multiple Single Points of Failure", is it really
true?
>>>>>>>>> Why MapR uses HDFS but not HDFS2 in its comparison as
this would
>>>>>>>>> lead to an unfair comparison (or even misleading comparison)?
 (HDFS was
>>>>>>>>> from Hadoop 1.x, the old generation) HDFS2 is available
since 2013-10-15,
>>>>>>>>> there is no any Single Points of  Failure in HDFS2.
>>>>>>>>>
>>>>>>>>> b) "Limit to 50-200 million files", is it really true?
>>>>>>>>> I have seen so many real world Hadoop Clusters with over
10PB
>>>>>>>>> data, some even with 150PB data.  If "Limit to 50 -200
millions files" were
>>>>>>>>> true in HDFS2, why are there so many production Hadoop
clusters in real
>>>>>>>>> world? how can they mange well the issue of  "Limit to
50-200 million
>>>>>>>>> files"? For instances,  the Facebook's "Like" implementation
runs on HBase
>>>>>>>>> at Web Scale, I can image HBase generates huge number
of files in Facbook's
>>>>>>>>> Hadoop cluster, the number of files in Facebook's Hadoop
cluster should be
>>>>>>>>> much much bigger than 50-200 million.
>>>>>>>>>
>>>>>>>>> From my point of view, in contrast, MaprFS should have
true
>>>>>>>>> limitation up to 1T files while HDFS2 can handle true
unlimited files,
>>>>>>>>> please do correct me if I am wrong.
>>>>>>>>>
>>>>>>>>> c) "Performance Bottleneck", again, is it really true?
>>>>>>>>> MaprFS does not have namenode in order to gain file system
>>>>>>>>> performance. If without Namenode, MaprFS would lose Data
Locality which is
>>>>>>>>> one of the beauties of Hadoop  If Data Locality is no
longer available, any
>>>>>>>>> big data application running on MaprFS might gain some
file system
>>>>>>>>> performance but it would totally lose the true gain of
performance from
>>>>>>>>> Data Locality provided by Hadoop's namenode (gain small
lose big)
>>>>>>>>>
>>>>>>>>> d) "Commercial NAS required"
>>>>>>>>> Is there any wiki/blog/discussion about Commercial NAS
on Hadoop
>>>>>>>>> Federation?
>>>>>>>>>
>>>>>>>>> regards
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> Want to work at Handy? Check out our culture deck and open
roles
>>>>>>>> <http://www.handy.com/careers>
>>>>>>>> Latest news <http://www.handy.com/press> at Handy
>>>>>>>> Handy just raised $50m
>>>>>>>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
led
>>>>>>>> by Fidelity
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message