crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Wills (JIRA)" <>
Subject [jira] [Commented] (CRUNCH-450) Adding ORC file format support in Crunch
Date Mon, 28 Jul 2014 19:20:39 GMT


Josh Wills commented on CRUNCH-450:

Wow- that is a phenomenal amount of work- thanks for sending it along! A couple of high-level

1) What does OrcTypeFamily buy me? We've flirted with expanding the set of TypeFamilies from
Avro and Writable in the past, but have always been cautious about actually doing it b/c the
two-typefamily assumption is baked into so many things in the system. If everything in Orc
is compiled down to a type of Writable, would it still work as a collection of derived PTypes
on top of the WritableTypeFamily?
2) We also try to avoid large and complex external dependencies in crunch-core-- could we
move this into a new submodule, crunch-hive, which would contain all of our Hive dependency
stuff? I think there's more of it that we want to include (e.g., CRUNCH-340) and a few other
things I wouldn't mind having down the line, but I don't want to introduce the dependency
complexity for pipelines that don't actually make use of Hive stuff.

> Adding ORC file format support in Crunch
> ----------------------------------------
>                 Key: CRUNCH-450
>                 URL:
>             Project: Crunch
>          Issue Type: New Feature
>          Components: Core, IO
>            Reporter: Wang Zhong
>            Assignee: Josh Wills
>         Attachments: CRUNCH-450.patch
> This JIRA adds ORC file format support in Crunch by:
> --
> 1. Adding input source and output target for ORC
> 2. Adding a new type family - OrcTypeFamily to serialize / deserialize objects into OrcStruct
> 3. Supporting column pruning optimization

This message was sent by Atlassian JIRA

View raw message