hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jagaran das <jagaran_...@yahoo.co.in>
Subject Re: Namenode Scalability
Date Wed, 10 Aug 2011 23:15:58 GMT
What would cause the name node to have a GC issue?

- I am writing opening at max 5000 connections and writing continuously through those 5000
connections to 5000 files at a time.  
      - The volume of data that I would write through 5000 connections cannot be controlled
as it is depends on upstream applications that publish data.

Now if the heap memory nears the full size (let say M GB) and when the major GC cycle kicks
in, the NameNode could stop responding for some time.
This "stop the world" time should be directly proportional to the Heap Size.
This may cause the data being blogged on the streaming application's memory.

As of our architecture,

It has a cluster of JMS Queue and We have multithreaded application that picks the messages
from the queue   and streams it to NameNode of a 20 Node cluster
using FileSystem API as exposed. 

BTW, in real world if you have a fast car, you can race and win against a slow train, it all
depends from what reference frame you are in :)


From: Michel Segel <michael_segel@hotmail.com>
To: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>
Cc: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>; jagaran das <jagaran_das@yahoo.co.in>
Sent: Wednesday, 10 August 2011 11:26 AM
Subject: Re: Namenode Scalability

So many questions, why stop there?

First question... What would cause the name node to have a GC issue?
Second question... You're streaming 1PB a day. Is this a single stream of data?
Are you writing this to one file before processing, or are you processing the data directly
on the ingestion stream?

Are you also filtering the data so that you are not saving all of the data?

This sounds like a homework assignment than a real world problem.

I guess people don't race cars against trains or have two trains traveling in different directions
anymore... :-)

Sent from a remote device. Please excuse any typos...

Mike Segel

On Aug 10, 2011, at 12:07 PM, jagaran das <jagaran_das@yahoo.co.in> wrote:

> To be precise, the projected data is around 1 PB.
> But the publishing rate is also around 1GBPS.
> Please suggest.
> ________________________________
> From: jagaran das <jagaran_das@yahoo.co.in>
> To: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>
> Sent: Wednesday, 10 August 2011 12:58 AM
> Subject: Namenode Scalability
> In my current project we  are planning to streams of data to Namenode (20 Node Cluster).
> Data Volume would be around 1 PB per day.
> But there are application which can publish data at 1GBPS.
> Few queries:
> 1. Can a single Namenode handle such high speed writes? Or it becomes unresponsive when
GC cycle kicks in.
> 2. Can we have multiple federated Name nodes  sharing the same slaves and then we can
distribute the writes accordingly.
> 3. Can multiple region servers of HBase help us ??
> Please suggest how we can design the streaming part to handle such scale of data. 
> Regards,
> Jagaran Das 
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message