crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Wills (JIRA)" <>
Subject [jira] [Commented] (CRUNCH-331) Change default settings for CombineFileInputFormat
Date Mon, 03 Feb 2014 18:28:09 GMT


Josh Wills commented on CRUNCH-331:

Would love some input on how best to fix this; thoughts include:

1) Switch the default behavior of the control parameter crunch.disable.combine.file to "true",
and override it to false in text/seq/avro format classes.
Pros: probably the least invasive code change, and will only enable combine files when the
source developer knows its safe to do it.
Cons: I have an allergy to config parameters that default to "true" instead of "false" from
my Google days, which might be worth overlooking in this instance. Also, we would slow down
(but not break) any jobs that were using some other FileInputFormat without being aware of
the config file change. That's not the end of the world (the behavior could be re-instated
with a commandline flag), but it's going to cause some confusion.

2) Only enable combine file input formats for FileInputFormat subclasses we know we can support--
text/seq/avro, and leave the config flag as it is to control usage.
Pros: no config changes required, other FileInputFormat extensions work properly.
Cons: Would need some way for other FileInputFormats to signal that they were combine-able
if they're not one of the defaults, which probably means introducing another config parameter.
So we would have two config parameters doing really similar-but-not-quite-identical things,
which also isn't great.

> Change default settings for CombineFileInputFormat
> --------------------------------------------------
>                 Key: CRUNCH-331
>                 URL:
>             Project: Crunch
>          Issue Type: Bug
>          Components: IO
>    Affects Versions: 0.9.0, 0.8.2
>            Reporter: Josh Wills
> Currently, we default to enabling the CombineFileInputFormat settings for any extensions
of FileSourceImpl b/c it tends to improve performance for common file formats like text, sequence
files, and Avro files. However, this default has caused problems for formats like Parquet
and for custom file formats that have complex split logic.
> This JIRA is to track modifying the default combine file settings in at least some contexts,
such as with From.formattedFile for custom input formats.

This message was sent by Atlassian JIRA

View raw message