hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@veoh.com>
Subject Re: long write operations and data recovery
Date Fri, 29 Feb 2008 19:33:03 GMT

Unless your volume is MUCH higher than ours, I think you can get by with a
relatively small farm of log consolidators that collect and concatenate
files.

If each log line is 100 bytes after compression (that is huge really) and
you have 10,000 events per second (also pretty danged high) then you are
only writing 1MB/s.  If you need a day of buffering (=100,000 seconds), then
you need 100GB of buffer storage.  These are very, very moderate
requirements for your ingestion point.


On 2/29/08 11:18 AM, "Steve Sapovits" <ssapovits@invitemedia.com> wrote:

> Ted Dunning wrote:
> 
>> In our case, we looked at the problem and decided that Hadoop wasn't
>> feasible for our real-time needs in any case.  There were several
>> issues,
>> 
>> - first, of all, map-reduce itself didn't seem very plausible for
>> real-time applications.  That left hbase and hdfs as the capabilities
>> offered by hadoop (for real-time stuff)
> 
> We'll be using map-reduce batch mode, so we're okay there.
> 
>> The upshot is that we use hadoop extensively for batch operations
>> where it really shines.  The other nice effect is that we don't have
>> to worry all that much about HA (at least not real-time HA) since we
>> don't do real-time with hadoop.
> 
> What I'm struggling with is the write side of things.  We'll have a huge
> amount of data to write that's essentially a log format.  It would seem
> that writing that outside of HDFS then trying to batch import it would
> be a losing battle -- that you would need the distributed nature of HDFS
> to do very large volume writes directly and wouldn't easily be able to take
> some other flat storage model and feed it in as a secondary step without
> having the HDFS side start to lag behind.
> 
> The realization is that Name Node could go down so we'll have to have a
> backup store that might be used during temporary outages, but that
> most of the writes would be direct HDFS updates.
> 
> The alternative would seem to be to end up with a set of distributed files
> without some unifying distributed file system (e.g., like lots of Apache
> web logs on many many individual boxes) and then have to come up with
> some way to funnel those back into HDFS.


Mime
View raw message