hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-157) Job History log file format is not friendly for external tools.
Date Thu, 13 Aug 2009 18:41:14 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742921#action_12742921

Doug Cutting commented on MAPREDUCE-157:

Owen> Of course reading is the reverse. It would be like writing xml files by generating
the necessary DOM objects.

Not sure what you mean.  Jackson has an event-based JSON reading API.


So, to efficiently read things back into structs you might use an enum of field names, e.g.:
class Foo { int a; String b; }
enum FooFields { A, B }
void readFoo(JsonParser parser, Foo foo) {
  if (parser.nextToken() != JsonToken.START_OBJECT)
    throw new Exception();
  while(parser.nextToken() != JsonToken.END_OBJECT) {
    switch (Enum.getValue(FooFields.class, parser.getCurrentName()))) {
    case A: foo.a = parser.getIntValue(); break;
    case B: foo.b = parser.getText(); break;

FWIW, Avro supports SAX-like streaming, without object creation.  A significant change if
we used Avro would be that we'd need to store the schema with the data.  We could, for example,
make the first line of log files the schema, or write a side file, but there's not much point
to Avro data without storing a schema.

Is the implicit schema proposed here Map<String,String>?  For example, would integer
values be written as JSON strings, with quotes, or as JSON integers, without quotes?  If the
schema is Map<String,String> and will be for all time, then there's less point to using
Avro.  But if fields are typed it might be nice to record the types in a schema.

> Job History log file format is not friendly for external tools.
> ---------------------------------------------------------------
>                 Key: MAPREDUCE-157
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-157
>             Project: Hadoop Map/Reduce
>          Issue Type: Sub-task
>            Reporter: Owen O'Malley
>            Assignee: Jothi Padmanabhan
> Currently, parsing the job history logs with external tools is very difficult because
of the format. The most critical problem is that newlines aren't escaped in the strings. That
makes using tools like grep, sed, and awk very tricky.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message