crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Wills (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-165) Pipelines should automatically use CombineFileInputFormat where input consists of many small files
Date Mon, 25 Nov 2013 04:46:36 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13831173#comment-13831173
] 

Josh Wills commented on CRUNCH-165:
-----------------------------------

Shouldn't HFileInputFormat just turn on DISABLE_COMBINE_FILE? That's what we do for e.g. ParquetFileFormat.

> Pipelines should automatically use CombineFileInputFormat where input consists of many
small files
> --------------------------------------------------------------------------------------------------
>
>                 Key: CRUNCH-165
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-165
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.4.0
>            Reporter: Dave Beech
>            Assignee: Josh Wills
>             Fix For: 0.8.0
>
>         Attachments: CRUNCH-165-jwills.patch, CRUNCH-165-v3.patch, CRUNCH-165-v4.patch,
CRUNCH-165.patch
>
>
> Hive had a feature introduced in HIVE-74 whereby CombineFileInputFormat would be used
if the input data consisted of many small files, making the resulting mapreduce jobs more
efficient by giving individual mappers more data to process. This would be a nice feature
for Crunch to have, too.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message