crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jay Vyas <>
Subject crunch : correct way to think about tuple abstractions for aggregations?
Date Sat, 04 Jan 2014 20:34:31 GMT
Hi crunch !

I want to process a list in crunch:

Something like this:

        PCollection<String> lines = MemPipeline.collectionOf(
                "BigPetStore,storeCode_AK,1  lindsay,franco,Sat Jan 10
00:11:10 EST 1970,10.5,dog-food",
                "BigPetStore,storeCode_AZ,1  tom,giles,Sun Dec 28 23:08:45
EST 1969,10.5,dog-food",
                "BigPetStore,storeCode_CA,1  brandon,ewing,Mon Dec 08
20:23:57 EST 1969,16.5,organic-dog-food",
                "BigPetStore,storeCode_CA,2  angie,coleman,Thu Dec 11
07:00:31 EST 1969,10.5,dog-food",
                "BigPetStore,storeCode_CA,3  angie,coleman,Tue Jan 20
06:24:23 EST 1970,7.5,cat-food",
                "BigPetStore,storeCode_CO,1  sharon,trevino,Mon Jan 12
07:52:10 EST 1970,30.1,antelope snacks",
                "BigPetStore,storeCode_CT,1  kevin,fitzpatrick,Wed Dec 10
05:24:13 EST 1969,10.5,dog-food",
                "BigPetStore,storeCode_NY,1  dale,holden,Mon Jan 12
23:02:13 EST 1970,19.75,fish-food",
                "BigPetStore,storeCode_NY,2  dale,holden,Tue Dec 30
12:29:52 EST 1969,10.5,dog-food",
                "BigPetStore,storeCode_OK,1  donnie,tucker,Sun Jan 18
04:50:26 EST 1970,7.5,cat-food");

        PCollection coll = lines.parallelDo(
              "split lines into words",
              new DoFn<String, String>() {
                  public void process(String line, Emitter emitter) {
                    //not sure this regex will work but you get the idea..
split by tabs and commas


What is the correct abstraction in crunch to convert raw text into tuples,
and access them by an index - which you then use to group and count on?

thanks !

** FYI ** this is for the bigpetstore project, id like to show crunch
examples in it if i can get them working,  as the API is a nice example of
a lowerlevel mapreduce paradigm which is more java freindly.

See and for details..

View raw message