hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmitriy V. Ryaboy (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1285) Allow SingleTupleBag to be serialized
Date Mon, 15 Mar 2010 20:55:27 GMT

    [ https://issues.apache.org/jira/browse/PIG-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845521#action_12845521
] 

Dmitriy V. Ryaboy commented on PIG-1285:
----------------------------------------

Thanks for the feedback.

Looking at the code, writeFields() and readFields() are actually implemented in DefaultAbstractBag,
and have no dependencies on the memory manager. Is there a good reason to not allow deserialization
of SingleTupleBags?  Seems to me that we can simply change SingleTupleBag to extend DefaultAbstractBag
and get rid of writeFields and readFields methods, allowing the defaults to take care of (de)serialization.
Everything else would remain as-is, since currently SingleTupleBag implements the complete
interface and therefore will override anything memory-related DefaultAbstractBag does.

What do you think?

> Allow SingleTupleBag to be serialized
> -------------------------------------
>
>                 Key: PIG-1285
>                 URL: https://issues.apache.org/jira/browse/PIG-1285
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>             Fix For: 0.7.0
>
>         Attachments: PIG-1285.patch
>
>
> Currently, Pig uses a SingleTupleBag for efficiency when a full-blown spillable bag implementation
is not needed in the Combiner optimization.
> Unfortunately this can create problems. The below Initial.exec() code fails at run-time
with the message that a SingleTupleBag cannot be serialized:
> {code}
> @Override
> public Tuple exec(Tuple in) throws IOException {
>       // single record. just copy.
>       if (in == null) return null;   
>       try {
>          Tuple resTuple = tupleFactory_.newTuple(in.size());
>          for (int i=0; i< in.size(); i++) {
>            resTuple.set(i, in.get(i));
>         }
>         return resTuple;
>        } catch (IOException e) {
>          log.warn(e);
>          return null;
>       }
>     }
> {code}
> The code below can fix the problem in the UDF, but it seems like something that should
be handled transparently, not requiring UDF authors to know about SingleTupleBags.
> {code}
> @Override
> public Tuple exec(Tuple in) throws IOException {
>       // single record. just copy.
>       if (in == null) return null;   
>       
>       /*
>        * Unfortunately SingleTupleBags are not serializable. We cache whether a given
index contains a bag
>        * in the map below, and copy all bags into DefaultBags before returning to avoid
serialization exceptions.
>        */
>       Map<Integer, Boolean> isBagAtIndex = Maps.newHashMap();
>       
>       try {
>         Tuple resTuple = tupleFactory_.newTuple(in.size());
>         for (int i=0; i< in.size(); i++) {
>           Object obj = in.get(i);
>           if (!isBagAtIndex.containsKey(i)) {
>             isBagAtIndex.put(i, obj instanceof SingleTupleBag);
>           }
>           if (isBagAtIndex.get(i)) {
>             DataBag newBag = bagFactory_.newDefaultBag();
>             newBag.addAll((DataBag)obj);
>             obj = newBag;
>           }
>           resTuple.set(i, obj);
>         }
>         return resTuple;
>       } catch (IOException e) {
>         log.warn(e);
>         return null;
>       }
>     }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message