hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3999) Need to add host capabilites / abilities
Date Mon, 13 Oct 2008 13:00:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639040#action_12639040
] 

Steve Loughran commented on HADOOP-3999:
----------------------------------------

1. This would be good if it could be easily extended; rather than than a hard coded set of
values, clients could add other (key,value) info for schedulers to use. Things like expected-availability
for cycle-scavenging task-trackers, and other extensions that custom schedulers could use.
It could also integrate with diagnostics. 

2. There's a danger here in trying to do a full grid scheduler. Why Danger? Hard to get right,
there are other tools and products that can do a lot of this. Hadoop likes to push work near
the data and works best if the work is all Java.

3. Developers are surprisingly bad about estimating workload, especially if you have a few
layers between you and the MR jobs. The best metric for how long/CPU-intensive/IO intensive
a job will be is "what was like last time".

> Need to add host capabilites / abilities
> ----------------------------------------
>
>                 Key: HADOOP-3999
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3999
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: metrics
>         Environment: Any
>            Reporter: Kai Mosebach
>
> The MapReduce paradigma is limited to run MapReduce jobs with the lowest common factor
of all nodes in the cluster.
> On the one hand this is wanted (cloud computing, throw simple jobs in, nevermind who
does it)
> On the other hand this is limiting the possibilities quite a lot, for instance if you
had data which could/needs to be fed to a 3rd party interface like Mathlab, R, BioConductor
you could solve a lot more jobs via hadoop.
> Furthermore it could be interesting to know about the OS, the architecture, the performance
of the node in relation to the rest of the cluster. (Performance ranking)
> i.e. if i'd know about a sub cluster of very computing performant nodes or a sub cluster
of very fast disk-io nodes, the job tracker could select these nodes regarding a so called
job profile (i.e. my job is a heavy computing job / heavy disk-io job), which can usually
be estimated by a developer before.
> To achieve this, node capabilities could be introduced and stored in the DFS, giving
you
> a1.) basic information about each node (OS, ARCH)
> a2.) more sophisticated infos (additional software, path to software, version). 
> a3.) PKI collected about the node (disc-io, cpu power, memory)
> a4.) network throughput to neighbor hosts, which might allow generating a network performance
map over the cluster
> This would allow you to
> b1.) generate jobs that have a profile (computing intensive, disk io intensive, net io
intensive)
> b2.) generate jobs that have software dependencies (run on Linux only, run on nodes with
MathLab only)
> b3.) generate a performance map of the cluster (sub clusters of fast disk nodes, sub
clusters of fast CPU nodes, network-speed-relation-map between nodes)
> From step b3) you could then even acquire statistical information which could again be
fed into the DFS Namenode to see if we could store data on fast disk subclusters only (that
might need to be a tool outside of hadoop core though)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message