hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arkady Borkovsky <ark...@yahoo-inc.com>
Subject Re: [jira] Commented: (HADOOP-489) Seperating user logs from system logs in map reduce
Date Wed, 06 Sep 2006 16:34:16 GMT
+1 on most point

General:  yes, simple things should be configurable

4:  (Copying log files from task trackers to HDFS)
      -- yes it should be configurable
      -- the default should be to copy to a default subdirectory of the  
job user's home
      -- it would be nice to copy logs to HDFS periodically -- e.g.  
whenever a task ends or every NNN seconds, whatever is larger.  (HDFS  
files have to be closed before they are readable, so copying logs to  
HDFS needs to be done whole file at a time, may be improved later).

Implementing this proposal will make users' life (my life) much more  

On Sep 5, 2006, at 7:15 PM, Michel Tourn (JIRA) wrote:

>     [  
> http://issues.apache.org/jira/browse/HADOOP-489? 
> page=comments#action_12432715 ]
> Michel Tourn commented on HADOOP-489:
> -------------------------------------
> 1) One log file per job per task tracker: sounds good.
> Avoiding tasks that write simultaneously to the shared job log:
> you could write to a per-Task temp file, then atomically catenate to  
> the job file at the end.
> Ways to enforce Atomicity:
> a) TaskTracker rather than TaskRunner is responsible for catenate.  
> That way you can assume there is only one such server running
> (one per machine, one per config set or per HADOOP_IDENT_STRING)
> b) Use an interprocess locking mechanism.
> The standard way in Java is to use java.nio.channels.FileLock.
> Just like with pid files, you can encode the pid in the FileLock name  
> to help detect orphan lock files.
> HadoopStreaming used to have code to do b)
> Aside: did you mention that there is a need for an index into the  
> per-machine job log?
> When the servlet API serves a task log content: it needs to retrieve a  
> *range* of the job log.
> Associated to a joblog file, there is a list of {taskid, begin offset,  
> length}.
> 2) says "all errors" 3) says "just one". Which is it?
> I would propose: either one, configurable.
> Note there is a large variety of jobclient applications:
> each pure-java user application and HadoopStreaming.
> So the canonical client-side log retrieval code should:
> A. be customizable: hooks that let you grab 1 or k or ALL task logs.
> B. should not assume it has complete control over the Java  
> application's stdout/stderr: user code normally has control over this.
> C. should fit as extension to the normal job submitter pattern.
> For example:
> B+C:
>      new TaskLogsSubscriber(System.err); // NEW
>       while (! running_.isComplete()) {
>         sleep a bit;
>         running_ = jc_.getJob(jobId_);
>         if (!report.equals(lastReport)) println(report);
>       }
> A: TaskLogsSubscriber API ideas.
> This API would be used in the sample JobSubmitter examples.
> a goal should be: provide useful out-of-the box behaviour. But also  
> allow the user to customize everything, starting from the servlet  
> request.
> boolean showLog(String full_task_id, boolean failed);
> // default implem: remember first failed id, first non-failed id.  
> Return false for all other ids.
> void printTaskLog(String full_task_id)
> // default implem: open TaskTracker servlet URL, pass the  
> URLInputStream to printLogStream.
> void printLogStream(String full_task_id , InputStream in)
> // default implem: consume and write all to System.err (of the  
> JobSubmitter process)
> caveat: "taskid" should be abstracted enough to address :
> both map and reduce tasks and both cluster and local-maprunner tasks.
>> 3) iii) This would entail running a servlet on each of the  
>> tasktrackers..
> Yes. And since these processes already run a Jetty instance, the  
> incremental overhead is minimal.
>> 4) Yes, this gives the best of both worlds (real-time access and HDFS  
>> access)
> Seems that the JobConf alternative is simpler and good enough.
> I suppose you mean something like
> JobConf.setJobLogDirectory(Path dfsPath)
> If you don't call it, the logs are not moved to dfs.
> This also resolves the issue of how the log data may later become a  
> mapreduce job input
>> 5) job-log files could be deleted on schedule
> Yes, same as what is done to delete the global TaskTracker logs.
> But with its own configurable deletion delay.
>> Seperating user logs from system logs in map reduce
>> ---------------------------------------------------
>>                 Key: HADOOP-489
>>                 URL: http://issues.apache.org/jira/browse/HADOOP-489
>>             Project: Hadoop
>>          Issue Type: Improvement
>>          Components: mapred
>>            Reporter: Mahadev konar
>>         Assigned To: Mahadev konar
>>            Priority: Minor
>> Currently the user logs are a part of system logs in mapreduce.  
>> Anything logged by the user is logged into the tasktracker log files.  
>> This create two issues-
>> 1) The system log files get cluttered with user output. If the user  
>> outputs a large amount of logs, the system logs need to be cleaned up  
>> pretty often.
>> 2) For the user, it is difficult to get to each of the machines and  
>> look for the logs his/her job might have generated.
>> I am proposing three solutions to the problem. All of them have  
>> issues with it -
>> Solution 1.
>> Output the user logs on the user screen as part of the job submission  
>> process.
>> Merits-
>> This will prevent users from printing large amount of logs and the  
>> user can get runtime feedback on what is wrong with his/her job.
>> Issues -
>> This proposal will use the framework bandwidth while running jobs for  
>> the user. The user logs will need to pass from the tasks to the  
>> tasktrackers, from the tasktrackers to the jobtrackers and then from  
>> the jobtrackers to the jobclient using a lot of framework bandwidth  
>> if the user is printing out too much data.
>> Solution 2.
>> Output the user logs onto a dfs directory and then concatenate these  
>> files. Each task can create a file for the output in the log  
>> direcotyr for a given user and jobid.
>> Issues -
>> This will create a huge amount of small files in DFS which later can  
>> be concatenated into a single file. Also there is this issue that who  
>> would concatenate these files into a single file? This could be done  
>> by the framework (jobtracker) as part of the cleanup for the jobs -  
>> might stress the jobtracker.
>> Solution 3.
>> Put the user logs into a seperate user log file in the log directory  
>> on each tasktrackers. We can provide some tools to query these local  
>> log files. We could have commands like for jobid j and for taskid t  
>> get me the user log output. These tools could run as a seperate map  
>> reduce program with each map grepping the user log files and a single  
>> recude aggregating these logs in to a single dfs file.
>> Issues-
>> This does sound like more work for the user. Also, the output might  
>> not be complete since a tasktracker might have went down after it ran  
>> the job.
>> Any thoughts?
> -- 
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the  
> administrators:  
> http://issues.apache.org/jira/secure/Administrators.jspa
> -
> For more information on JIRA, see:  
> http://www.atlassian.com/software/jira

View raw message