hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joydeep Sen Sarma" <jssa...@facebook.com>
Subject RE: long write operations and data recovery
Date Fri, 29 Feb 2008 21:54:35 GMT
I would agree with Ted. You should easily be able to get 100MBps write
throughput on a standard Netapp box (with read bandwidth left over -
since the peak write throughput rating is more than twice of that). Even
at an average write throughput rate of 50MBps - the daily data volume
would be (drumroll ..) 4+TB! 

So buffer to a decent box and copy stuff over ..

-----Original Message-----
From: Ted Dunning [mailto:tdunning@veoh.com] 
Sent: Friday, February 29, 2008 11:33 AM
To: core-user@hadoop.apache.org
Subject: Re: long write operations and data recovery

Unless your volume is MUCH higher than ours, I think you can get by with
relatively small farm of log consolidators that collect and concatenate

If each log line is 100 bytes after compression (that is huge really)
you have 10,000 events per second (also pretty danged high) then you are
only writing 1MB/s.  If you need a day of buffering (=100,000 seconds),
you need 100GB of buffer storage.  These are very, very moderate
requirements for your ingestion point.

On 2/29/08 11:18 AM, "Steve Sapovits" <ssapovits@invitemedia.com> wrote:

> Ted Dunning wrote:
>> In our case, we looked at the problem and decided that Hadoop wasn't
>> feasible for our real-time needs in any case.  There were several
>> issues,
>> - first, of all, map-reduce itself didn't seem very plausible for
>> real-time applications.  That left hbase and hdfs as the capabilities
>> offered by hadoop (for real-time stuff)
> We'll be using map-reduce batch mode, so we're okay there.
>> The upshot is that we use hadoop extensively for batch operations
>> where it really shines.  The other nice effect is that we don't have
>> to worry all that much about HA (at least not real-time HA) since we
>> don't do real-time with hadoop.
> What I'm struggling with is the write side of things.  We'll have a
> amount of data to write that's essentially a log format.  It would
> that writing that outside of HDFS then trying to batch import it would
> be a losing battle -- that you would need the distributed nature of
> to do very large volume writes directly and wouldn't easily be able to
> some other flat storage model and feed it in as a secondary step
> having the HDFS side start to lag behind.
> The realization is that Name Node could go down so we'll have to have
> backup store that might be used during temporary outages, but that
> most of the writes would be direct HDFS updates.
> The alternative would seem to be to end up with a set of distributed
> without some unifying distributed file system (e.g., like lots of
> web logs on many many individual boxes) and then have to come up with
> some way to funnel those back into HDFS.

View raw message