crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chao Shi (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-173) Make WritableTypeFamily more compact for composite types
Date Tue, 26 Nov 2013 02:17:35 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13832201#comment-13832201
] 

Chao Shi commented on CRUNCH-173:
---------------------------------

My pipeline finished in 5 hours, which used to take more than a day without the patch. This
number is consistent with the previous test at much smaller scale.

bq. Chao Shi If you can add it here, it would be super-interesting to hear more about your
test case pipeline (i.e. the size of the tuples that you're writing, etc).

We are using crunch to build index shards for a search service. The most time-consuming stage
is building the posting lists. In one shard (i.e. reducer), there are ~1 billion small records.
Each record is an entry in a posting list. The sort key is  term then doc no. Term and doc
no are both longs.

> Make WritableTypeFamily more compact for composite types
> --------------------------------------------------------
>
>                 Key: CRUNCH-173
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-173
>             Project: Crunch
>          Issue Type: Bug
>          Components: Core
>            Reporter: Josh Wills
>            Assignee: Josh Wills
>         Attachments: CRUNCH-173.patch, CRUNCH-173b.patch
>
>
> I'm throwing this out as something of a strawman JIRA: it's always bugged me how verbose
the serialization of TupleWritable et al. are compared to the Avro formats, so I took a crack
at changing their underlying serialization to be more compact by doing more things in terms
of BytesWritable and using the wrapping MapFns in order to do more of the de-serialization
work. Patch is attached, if anyone is interested in this or has an opinion on whether or not
this is a good idea, I'd love to hear it. The big pro is that Crunch jobs that have to use
writables will run faster as a result, the downside is that it's not backwards compatible
and it makes the code more complex.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message