hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <goks...@gmail.com>
Subject Re: Improving MR job disk IO
Date Tue, 15 Oct 2013 03:17:04 GMT
There are a few reasons to use map/reduce, or just map-only or 
reduce-only jobs.
1) You want to do parallel algorithms where data from multiple machines 
have to be cross-checked. Map/Reduce allows this.
2) You want to run several instances of a job. Hadoop does this reliably 
by monitoring all instances, restarting failed ones, etc.
3) You have way too much data to fit on one computer. Same as #2.

You might not need Hadoop if you can run your programs without it.


On 10/14/2013 08:02 PM, Xuri Nagarin wrote:
> Yes, I tested with smaller data sets and the MR job correctly 
> reads/matches one line at a time.
> On Fri, Oct 11, 2013 at 4:48 AM, DSuiter RDX <dsuiter@rdx.com 
> <mailto:dsuiter@rdx.com>> wrote:
>     So, perhaps this has been thought of, but perhaps not.
>     It is my understanding that grep is usually sorting things one
>     line at a time. As I am currently experimenting with Avro, I am
>     finding that the local grep function does not handle it well at
>     all, because it is one long line essentially, so working from
>     local Avro, grep does not do well at pattern matching, it just
>     returns the whole file as a match, and it takes a long time to
>     view it in vi editor as well since there are no EOL markers.
>     If you have modified for sequence file, are you reading a sequence
>     file that has newline characters? If not, perhaps the file is
>     being read as one whole line, causing some unexpected effects.
>     Thanks,
>     *Devin Suiter*
>     Jr. Data Solutions Software Engineer
>     100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
>     Google Voice: 412-256-8556 <tel:412-256-8556> | www.rdx.com
>     <http://www.rdx.com/>
>     On Thu, Oct 10, 2013 at 4:50 PM, Xuri Nagarin <secsubs@gmail.com
>     <mailto:secsubs@gmail.com>> wrote:
>         On Thu, Oct 10, 2013 at 1:27 PM, Pradeep Gollakota
>         <pradeepg26@gmail.com <mailto:pradeepg26@gmail.com>> wrote:
>             I don't think it necessarily means that the job is a bad
>             candidate for MR. It's a different type of a workload.
>             Hortonworks has a great article on the different types of
>             workloads you might see and how that affects your
>             provisioning choices at
>             http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.3.2/bk_cluster-planning-guide/content/ch_hardware-recommendations.html
>         One statement that stood out to me in the link above is "For
>         these reasons, Hortonworks recommends that you either use the
>         Balanced workload configuration or invest in a pilot Hadoop
>         cluster and plan to evolve as you analyze the workload
>         patterns in your environment."
>         Now, this is not a critique/concern of HW but rather of
>         hadoop. Well, what if my workloads can be both CPU and IO
>         intensive? Do I take the approach of
>         throw-enough-excess-hardware-just-in-case?
>             I have not looked at the Grep code so I'm not sure why
>             it's behaving the way it is. Still curious that streaming
>             has a higher IO throughput and lower CPU usage. It may
>             have to do with the fact that /bin/grep is a native
>             implementation and Grep (Hadoop) is probably using Java
>             Pattern/Matcher api.
>         The Grep code is from the bundled examples in CDH. I made one
>         line modification for it to read Sequence files. The streaming
>         job probably does not have lower CPU utilization but I see
>         that it does even out the CPU utilization among all the
>         available processors. I guess the native grep binary threads
>         better than the java MR job?
>         Which brings me to ask - If you have the mapper/reducer
>         functionality built into a platform specific binary, then
>         won't it always be more efficient than a java MR job? And, in
>         such cases, am I better off with streaming than Java MR?
>         Thanks for your responses.
>             On Thu, Oct 10, 2013 at 12:29 PM, Xuri Nagarin
>             <secsubs@gmail.com <mailto:secsubs@gmail.com>> wrote:
>                 Thanks Pradeep. Does it mean this job is a bad
>                 candidate for MR?
>                 Interestingly, running the cmdline '/bin/grep' under a
>                 streaming job provides (1) Much better disk throughput
>                 and, (2) CPU load is almost evenly spread across all
>                 cores/threads (no CPU gets pegged to 100%).
>                 On Thu, Oct 10, 2013 at 11:15 AM, Pradeep Gollakota
>                 <pradeepg26@gmail.com <mailto:pradeepg26@gmail.com>>
>                 wrote:
>                     Actually... I believe that is expected behavior.
>                     Since your CPU is pegged at 100% you're not going
>                     to be IO bound. Typically jobs tend to be CPU
>                     bound or IO bound. If you're CPU bound you expect
>                     to see low IO throughput. If you're IO bound, you
>                     expect to see low CPU usage.
>                     On Thu, Oct 10, 2013 at 11:05 AM, Xuri Nagarin
>                     <secsubs@gmail.com <mailto:secsubs@gmail.com>> wrote:
>                         Hi,
>                         I have a simple Grep job (from bundled
>                         examples) that I am running on a 11-node
>                         cluster. Each node is 2x8-core Intel Xeons
>                         (shows 32 CPUs with HT on), 64GB RAM and 8 x
>                         1TB disks. I have mappers set to 20 per node.
>                         When I run the Grep job, I notice that CPU
>                         gets pegged to 100% on multiple cores but disk
>                         throughput remains a dismal 1-2 Mbytes/sec on
>                         a single disk on each node. So I guess, the
>                         cluster is poorly performing in terms of disk
>                         IO. Running Terasort, I see each disk puts out
>                         25-35 Mbytes/sec with a total cluster
>                         throughput of above 1.5 Gbytes/sec.
>                         How do I go about re-configuring or re-writing
>                         the job to utilize maximum disk IO?
>                         TIA,
>                         Xuri

View raw message