hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From TCK <moonwatcher32...@yahoo.com>
Subject Batch processing with Hadoop -- does HDFS scale for parallel reads?
Date Wed, 04 Feb 2009 17:51:28 GMT

Hey guys, 

We have been using Hadoop to do batch processing of logs. The logs get written and stored
on a NAS. Our Hadoop cluster periodically copies a batch of new logs from the NAS, via NFS
into Hadoop's HDFS, processes them, and copies the output back to the NAS. The HDFS is cleaned
up at the end of each batch (ie, everything in it is deleted).

The problem is that reads off the NAS via NFS don't scale even if we try to scale the copying
process by adding more threads to read in parallel.

If we instead stored the log files on an HDFS cluster (instead of NAS), it seems like the
reads would scale since the data can be read from multiple data nodes at the same time without
any contention (except network IO, which shouldn't be a problem).

I would appreciate if anyone could share any similar experience they have had with doing parallel
reads from a storage HDFS.

Also is it a good idea to have a separate HDFS for storage vs for doing the batch processing

Best Regards,

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message