incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthias Friedrich (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-97) Add helpers for parsing PCollection<String> instances
Date Wed, 17 Oct 2012 16:08:03 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-97?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13477986#comment-13477986
] 

Matthias Friedrich commented on CRUNCH-97:
------------------------------------------

Looks cool! Having parsed way too much text myself, there's a few things I'm missing. Right
now there doesn't seem to be much in the way of error and missing value handling (noticed
none in the test case at least). To make this universally applicable (which would be the goal
for o.a.c.lib, as opposed to contrib) we'd need a bit more support for dealing with crappy
data.

At work we increment separate counters for each field that has an invalid value and a different
counter for records that are completely broken. This helps a lot with monitoring data streams
over time. Also, my experiences with Java 5 (I never re-measured this) was that throwing multiple
exceptions per record when dealing with crapping data significantly slows down processing,
even in situations when you think I/O bound should totally dominate. I've seen 600% increases
in runtime in pathological situations (throwing exceptions was fast in Java 5, but creating
the stack traces wasn't).

A few things from the nitpicking category: I'd move the inner classes to their own files to
make things easier to read, maybe move implementations to an Extractors class (Guava style);
the private stuff could be made package private. We could also use a package-info.java file
for the javadocs and the CRUNCH-97 marker is missing from the commit messages (you can squash
all three commits together using "rebase -i", this lets you edit the messages, too).

                
> Add helpers for parsing PCollection<String> instances
> -----------------------------------------------------
>
>                 Key: CRUNCH-97
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-97
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Josh Wills
>            Assignee: Josh Wills
>             Fix For: 0.4.0
>
>         Attachments: CRUNCH-97.patch, CRUNCH-97-take2.patch
>
>
> We should make it a bit easier to parse delimited text files into specific data types
(e.g., ints, floats, etc.) or combinations of types-- e.g., pairs of strings and ints, a Tuple3
of booleans, etc.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message