crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabriel Reid (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-329) Re-add type info to TupleWritable to make fields sort correctly
Date Thu, 23 Jan 2014 12:22:37 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13879886#comment-13879886
] 

Gabriel Reid commented on CRUNCH-329:
-------------------------------------

The general working of the patch looks good to me, but if I'm reading things correctly I think
there will be some issues with the custom serialization codes that can be added to Writables.WRITABLE_CODES
for custom WritableComparables.

First issue is that the serialization code for a class depends on the order in which Writables.writables()
was called, so a new program (or even the altered original) might not be able to read a previously-created
PCollection of TupleWritables if the order in which Writables.writables was called has changed.

The other issue I think I see is that the state of Writables.WRITABLE_CODES that is set up
on the machine submitting the job won't be the same as it's state when it's running on a remote
JVM on the cluster, so it looks to me like custom WritableComparables won't work at all when
running on a real cluster.

I can think of a couple of ways to get around these issues, but I'm not wild about either
of them:
* don't handle custom WritableComparables as a special case at all, and just don't support
them (possibly throwing an exception if you try to do a secondary sort with them)
* use some kind of hashing algorithm on the class name to generate the serialization codes
for custom WritableComparable classes, and store the mapping from serialization code to class
name in the Configuration. With this one we have to watch out for id collisions, but we could
just fail fast if one of those happens.

> Re-add type info to TupleWritable to make fields sort correctly
> ---------------------------------------------------------------
>
>                 Key: CRUNCH-329
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-329
>             Project: Crunch
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.10.0, 0.8.3
>            Reporter: Josh Wills
>            Assignee: Josh Wills
>             Fix For: 0.10.0, 0.8.3
>
>         Attachments: fix-ss-writables.patch
>
>
> Secondary sorts aren't currently working correctly for Writable types after we hacked
the TupleWritable impl to make all of the fields BytesWritables (e.g., secondary IntWritable
values will no longer be sorted correctly, even though everything is still grouped correctly.)
> The least-bad way that I came up with to fix this is to use integer codes for each possible
WritableComparable type in a pipeline that we can use to decode what Writable type each tuple
field corresponds to. This allows us to keep the various fields sortable while still doing
a reasonable job of minimizing the serialization required to pass the type information along.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message