hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xuri Nagarin <secs...@gmail.com>
Subject Re: Improving MR job disk IO
Date Tue, 15 Oct 2013 03:02:03 GMT
Yes, I tested with smaller data sets and the MR job correctly reads/matches
one line at a time.




On Fri, Oct 11, 2013 at 4:48 AM, DSuiter RDX <dsuiter@rdx.com> wrote:

> So, perhaps this has been thought of, but perhaps not.
>
> It is my understanding that grep is usually sorting things one line at a
> time. As I am currently experimenting with Avro, I am finding that the
> local grep function does not handle it well at all, because it is one long
> line essentially, so working from local Avro, grep does not do well at
> pattern matching, it just returns the whole file as a match, and it takes a
> long time to view it in vi editor as well since there are no EOL markers.
>
> If you have modified for sequence file, are you reading a sequence file
> that has newline characters? If not, perhaps the file is being read as one
> whole line, causing some unexpected effects.
>
> Thanks,
> *Devin Suiter*
> Jr. Data Solutions Software Engineer
> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
> Google Voice: 412-256-8556 | www.rdx.com
>
>
> On Thu, Oct 10, 2013 at 4:50 PM, Xuri Nagarin <secsubs@gmail.com> wrote:
>
>> On Thu, Oct 10, 2013 at 1:27 PM, Pradeep Gollakota <pradeepg26@gmail.com>wrote:
>>
>>> I don't think it necessarily means that the job is a bad candidate for
>>> MR. It's a different type of a workload. Hortonworks has a great article on
>>> the different types of workloads you might see and how that affects your
>>> provisioning choices at
>>> http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.3.2/bk_cluster-planning-guide/content/ch_hardware-recommendations.html
>>>
>>
>> One statement that stood out to me in the link above is "For these
>> reasons, Hortonworks recommends that you either use the Balanced workload
>> configuration or invest in a pilot Hadoop cluster and plan to evolve as you
>> analyze the workload patterns in your environment."
>>
>> Now, this is not a critique/concern of HW but rather of hadoop. Well,
>> what if my workloads can be both CPU and IO intensive? Do I take the
>> approach of throw-enough-excess-hardware-just-in-case?
>>
>>
>>>
>>> I have not looked at the Grep code so I'm not sure why it's behaving the
>>> way it is. Still curious that streaming has a higher IO throughput and
>>> lower CPU usage. It may have to do with the fact that /bin/grep is a native
>>> implementation and Grep (Hadoop) is probably using Java Pattern/Matcher api.
>>>
>>
>> The Grep code is from the bundled examples in CDH. I made one line
>> modification for it to read Sequence files. The streaming job probably does
>> not have lower CPU utilization but I see that it does even out the CPU
>> utilization among all the available processors. I guess the native grep
>> binary threads better than the java MR job?
>>
>> Which brings me to ask - If you have the mapper/reducer functionality
>> built into a platform specific binary, then won't it always be more
>> efficient than a java MR job? And, in such cases, am I better off with
>> streaming than Java MR?
>>
>> Thanks for your responses.
>>
>>
>>
>>
>>>
>>>
>>> On Thu, Oct 10, 2013 at 12:29 PM, Xuri Nagarin <secsubs@gmail.com>wrote:
>>>
>>>> Thanks Pradeep. Does it mean this job is a bad candidate for MR?
>>>>
>>>> Interestingly, running the cmdline '/bin/grep' under a streaming job
>>>> provides (1) Much better disk throughput and, (2) CPU load is almost evenly
>>>> spread across all cores/threads (no CPU gets pegged to 100%).
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Oct 10, 2013 at 11:15 AM, Pradeep Gollakota <
>>>> pradeepg26@gmail.com> wrote:
>>>>
>>>>> Actually... I believe that is expected behavior. Since your CPU is
>>>>> pegged at 100% you're not going to be IO bound. Typically jobs tend to
be
>>>>> CPU bound or IO bound. If you're CPU bound you expect to see low IO
>>>>> throughput. If you're IO bound, you expect to see low CPU usage.
>>>>>
>>>>>
>>>>> On Thu, Oct 10, 2013 at 11:05 AM, Xuri Nagarin <secsubs@gmail.com>wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I have a simple Grep job (from bundled examples) that I am running
on
>>>>>> a 11-node cluster. Each node is 2x8-core Intel Xeons (shows 32 CPUs
with HT
>>>>>> on), 64GB RAM and 8 x 1TB disks. I have mappers set to 20 per node.
>>>>>>
>>>>>> When I run the Grep job, I notice that CPU gets pegged to 100% on
>>>>>> multiple cores but disk throughput remains a dismal 1-2 Mbytes/sec
on a
>>>>>> single disk on each node. So I guess, the cluster is poorly performing
in
>>>>>> terms of disk IO. Running Terasort, I see each disk puts out 25-35
>>>>>> Mbytes/sec with a total cluster throughput of above 1.5 Gbytes/sec.
>>>>>>
>>>>>> How do I go about re-configuring or re-writing the job to utilize
>>>>>> maximum disk IO?
>>>>>>
>>>>>> TIA,
>>>>>>
>>>>>> Xuri
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message