hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Joseph Evans (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-3973) [Umbrella JIRA] JobHistoryServer performance improvements in YARN+MR
Date Fri, 09 Mar 2012 21:00:58 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13226450#comment-13226450

Robert Joseph Evans commented on MAPREDUCE-3973:

I have been thinking about how to speed up the job history server, and also how we are going
to be able to cache all of the jobs that we have running on some of our 4000+ nodes clusters
were we can run in excess of 60000 jobs a day.  It seems to me that the job history server
is already complex and is going to get to be even more complicated going forward as we try
to add in yet another cache MAPREDUCE-3966 and possibly one more on top of that MAPREDUCE-3755.
 What is more all of these caches are backed by files in HDFS.  We need to keep all of these
caches consistent with one another, and with the operations that are going to happen to the
files on HDFS, and do it with out having a single huge lock MAPREDUCE-3972. I really think
that we could remove the vast majority of this complexity by changing how we cache this data.

I would like to propose that we abstract away how we cache the data with a pluggable data
access layer API.  The default implementation of this API would store the data in an embedded
version of Derby (http://db.apache.org/derby) or some other embedded SQL database that fits
our licensing requirements.  Because of the plug-ability it would allow users to replace derby
with MySQL, Oracle, or even H-Base, with minimal effort.  I think that this would drastically
reduce the size and complexity of our code, it would speed up the code and web service a lot
and it would open up the potential for us to move to a truly stateless history server.

I would love to get something like this in on 0.23.3 instead of having to do much of the other
work that has been suggested.  I am not tied to this approach nor to this timeframe.  I am
looking for feedback on the idea before I file a JIRA for it.  I know that there are a lot
of potential issues here.  When using a database to store the data we will now need to provide
a mapping between the entries in the object and database tables, or serialize objects into
blobs if we do not want to query on them. We are also going to have to eventually think about
how we migrate the schema of the database from one version to another, if we want to keep
the data in there long term.  In the short term we can treat the DB as a true cache and blow
it away each time the history server reboots.
> [Umbrella JIRA] JobHistoryServer performance improvements in YARN+MR
> --------------------------------------------------------------------
>                 Key: MAPREDUCE-3973
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3973
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobhistoryserver, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
> Few parallel efforts are happening w.r.t improving/fixing issues with JobHistoryServer
in MR over YARN. This is the umbrella ticket so we have the complete picture.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message