incubator-crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Barretta <mike.barre...@gmail.com>
Subject inconsistent grouping of map jobs
Date Wed, 20 Feb 2013 14:02:04 GMT
I have a number of "tables" in HDFS, represented as folders containing
SequenceFiles of serialized objects.  I'm trying to write a tool that will
reassemble these objects and output each of the tables into its own CSV
file.

The wrinkle is that some of the "tables" hold objects with a list of
related child objects.  Those related should get chopped into their own
table.

Here is essentially what my loop looks like (in Groovy):

//loop through each top-level table
paths.each { path ->
    def source = From.sequenceFile(new Path(path),

Writables.writables(ColumnKey.class),

Writables.writables(ColumnDataArrayWritable.class)
    )

    //read it in
    def data = crunchPipeline.read(source)

    //write it out
    crunchPipeline.write(
        data.parallelDo(new MyDoFn(path), Writables.strings()),
        To.textFile("$path/csv")
    )

    //handle children using same PTable as parent
    if (path == TABLE_MESSAGE_DATA) {
        messageChildPaths.each {  childPath ->
            crunchPipeline.write(
                data.parallelDo(new MyDoFn(childPath), Writables.strings()),
                To.textFile("$childPath/csv")
            )
        }
    }
}

The parent and child jobs generally get grouped into a single map job, but
most of the time, only some of the children tables get included, which is
to say, sometimes a child table does not get output.  There doesn't seem to
be a pattern - sometimes all of them get included, sometimes 1 or 2.

Am I missing something? Is there a way to specify which jobs should be
combined?

Thanks,
Mike

Mime
View raw message