avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yong Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Created] (AVRO-1953) ArrayIndexOutOfBoundsException in org.apache.avro.io.parsing.Symbol$Alternative.getSymbol
Date Mon, 07 Nov 2016 14:44:59 GMT
Yong Zhang created AVRO-1953:
--------------------------------

             Summary: ArrayIndexOutOfBoundsException in org.apache.avro.io.parsing.Symbol$Alternative.getSymbol
                 Key: AVRO-1953
                 URL: https://issues.apache.org/jira/browse/AVRO-1953
             Project: Avro
          Issue Type: Bug
    Affects Versions: 1.7.4
            Reporter: Yong Zhang


We are facing an issue when Avro MapReducer cannot process the avro file in the reducer. 

Here is the schema of our data:

{
    "namespace" : "our package name",
    "type" : "record",
    "name" : "Lists",
    "fields" : [
        {"name" : "account_id", "type" : "long"},
        {"name" : "list_id", "type" : "string"},
        {"name" : "sequence_id", "type" : ["int", "null"]} ,
        {"name" : "name", "type" : ["string", "null"]},
        {"name" : "state", "type" : ["string", "null"]},
        {"name" : "description", "type" : ["string", "null"]},
        {"name" : "dynamic_filtered_list", "type" : ["int", "null"]},
        {"name" : "filter_criteria", "type" : ["string", "null"]},
        {"name" : "created_at", "type" : ["long", "null"]},
        {"name" : "updated_at", "type" : ["long", "null"]},
        {"name" : "deleted_at", "type" : ["long", "null"]},
        {"name" : "favorite", "type" : ["int", "null"]},
        {"name" : "delta", "type" : ["boolean", "null"]},
        {
            "name" : "list_memberships", "type" : {
                "type" : "array", "items" : {
                    "name" : "ListMembership", "type" : "record",
                    "fields" : [
                        {"name" : "channel_id", "type" : "string"},
                        {"name" : "created_at", "type" : ["long", "null"]},
                        {"name" : "created_source", "type" : ["string", "null"]},
                        {"name" : "deleted_at", "type" : ["long", "null"]},
                        {"name" : "sequence_id", "type" : ["int", "null"]}
                    ]
                }
            }
        }
    ]
}

Our MapReduce job is to get the delta of the above dataset, and use our merge logic to merge
the latest change into the dataset.

The whole MR job runs daily, and work fine for 18 months. During this time, we saw 2 times
the merge MapReduce job failed with following error (In the reducer stage, which means the
Avro data being read successfully, and send to the reducers, which we sort the data based
on the key and timestamp, so the delta can be merged in the reducer side):

java.lang.ArrayIndexOutOfBoundsException at org.apache.avro.io.parsing.Symbol$Alternative.getSymbol(Symbol.java:364)
at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229) at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:152)
at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:177) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:148)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:139) at org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:108)
at org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:48) at org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:142)
at org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:117)
at org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:297)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:165) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:652)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420) at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(AccessController.java:366) at javax.security.auth.Subject.doAs(Subject.java:572)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1502) at
org.apache.hadoop.mapred.Child.main(Child.java:249)

The MapReducer job will fail eventually in the reducer stage. I don't think our data is corrupted,
as they are read fine in the map stage. Every time we got this error, we have to get the whole
huge dataset from the source, then rebuilt the AVRO, and start building merge again daily,
until after several months, then face this issue due to whatever reason we don't know yet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message