incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthias Friedrich <m...@mafr.de>
Subject Re: Review Request: Latest take on CRUNCH-97, text parsing lib for Crunch
Date Fri, 30 Nov 2012 20:23:21 GMT
Hi Josh,

sorry for taking so long. I've checked the code and I'm confident that
it works as intended. There are a few things about handling and
reporting missing data that I'd like to look into, but what we have
now is great stuff already.

Based on my own use cases I'd like to discuss a slightly different
solution. I'm not sure if my requirements are esoteric, but perhaps
that's what others need, too.

The data I work with usually comes in CSV format with many rows and
several hundreds of attributes. I have a schema for the file so I can
easily map column names to column numbers. When I process the file I
am often interested in only a small subset of columns. I would want
a parsing library to take a row of the file and give me a Tuple with
only the columns I'm interested in. Since we don't have a NamedTuple
abstraction, I don't want the Tuple to contain too much data that
I don't need.

This is possible with the current implementation, but the Scanner
stuff looks a bit unwieldy when it comes to skipping data. Here's how
I would like to specify the extraction process:

  Parse.parse(data, tokenizer,
        xtuple(xstring(0), xint(7), xboolean(3), xdouble(9)))

The int argument specifies the column number to extract the data from.
This approach would work best if we just take the input record and
turn it into a sequence of tokens. We could offer alternative
strategies for tokenizing, like regex for log parsing (we pull out the
groups specified in the pattern) or simple splitting at a static
or regex delimiter. The extractors get the sequence of tokens passed
in and take whatever they need.

I'm a bit busy right now but I'd help out with some code if you want.
It would probably take a bit until I can make some time though.

What do you think?

Regards,
  Matthias

On Sunday, 2012-11-25, Josh Wills wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/8151/
> -----------------------------------------------------------
> 
> (Updated Nov. 25, 2012, 7:21 p.m.)
> 
> 
> Review request for crunch.
> 
> 
> Changes
> -------
> 
> Incorporated feedback from Matthias and Gabriel; added a bunch of javadoc.
> 
> 
> Description
> -------
> 
> Latest and greatest rev of the extraction library for text parsing. I ended up refactoring
the approach so that we could support nested parsing (e.g., using different Scanner instances
for different parts of a line) and collections of items on a single line.
> 
> 
> This addresses bug CRUNCH-97.
>     https://issues.apache.org/jira/browse/CRUNCH-97
> 
> 
> Diffs (updated)
> -----
> 
>   crunch/src/main/java/org/apache/crunch/lib/PTables.java e788656 
>   crunch/src/main/java/org/apache/crunch/lib/PTables.java e788656 
>   crunch/src/main/java/org/apache/crunch/lib/text/AbstractCompositeExtractor.java PRE-CREATION

>   crunch/src/main/java/org/apache/crunch/lib/text/AbstractCompositeExtractor.java PRE-CREATION

>   crunch/src/main/java/org/apache/crunch/lib/text/AbstractSimpleExtractor.java PRE-CREATION

>   crunch/src/main/java/org/apache/crunch/lib/text/AbstractSimpleExtractor.java PRE-CREATION

>   crunch/src/main/java/org/apache/crunch/lib/text/Extractor.java PRE-CREATION 
>   crunch/src/main/java/org/apache/crunch/lib/text/Extractor.java PRE-CREATION 
>   crunch/src/main/java/org/apache/crunch/lib/text/ExtractorStats.java PRE-CREATION 
>   crunch/src/main/java/org/apache/crunch/lib/text/ExtractorStats.java PRE-CREATION 
>   crunch/src/main/java/org/apache/crunch/lib/text/Extractors.java PRE-CREATION 
>   crunch/src/main/java/org/apache/crunch/lib/text/Extractors.java PRE-CREATION 
>   crunch/src/main/java/org/apache/crunch/lib/text/Parse.java PRE-CREATION 
>   crunch/src/main/java/org/apache/crunch/lib/text/Parse.java PRE-CREATION 
>   crunch/src/main/java/org/apache/crunch/lib/text/ScannerFactory.java PRE-CREATION 
>   crunch/src/main/java/org/apache/crunch/lib/text/ScannerFactory.java PRE-CREATION 
>   crunch/src/test/java/org/apache/crunch/lib/text/ParseTest.java PRE-CREATION 
>   crunch/src/test/java/org/apache/crunch/lib/text/ParseTest.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/8151/diff/
> 
> 
> Testing
> -------
> 
> Unit tests so far, still gathering feedback on the approach.
> 
> 
> Thanks,
> 
> Josh Wills
> 

Mime
View raw message