incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rahul <rsha...@xebia.com>
Subject Customized Sorting
Date Wed, 18 Jul 2012 10:28:47 GMT
I am trying to  sort some data. The data had names and I was try to sort 
in the following manner.

*ORIGINAL DATA* *  SORTED DATA*
/Rahul                                               shekhar/
/rahul                                                Sameer/
/RAHUL              =====                     rahul/
/shekar               =====                     Rahul/
/hans                                                 RAHul/
/kasper                                              kasper/
/Sameer                                             hans/
/
/
This was a bit customized Sorting where I wanted to first sort them in 
lexicographic manner and then maybe take capitalization also into 
consideration.
Initially I was trying with the Sort API but was unsuccessful with that. 
But then I tried in a couple of ways as explained below :

In the first solution, I outputted each of the names them against their 
starting character in a /Ptable/. Then collected all the values for a 
particular key.
After that I selected all the values and then used a /Comparator /to 
sort data in each of the collection.

  /PTable<String, String> classifiedData = count.parallelDo( new NamesClassification(),Writables.tableOf(Writables.strings(),Writables.strings()));
  PTable<String, Collection<String> collectedValues = classifiedData.collectValues();
  PCollection<Collection<String> names = collectedValues.values();
  PCollection<Collection<String>> sortedNames = names.parallelDo("names Sorting",new
NamesSorting(), Writables.collections(Writables.strings()));/


Not completely convinced with the path I took. I spend some time of 
solving it and found another way of doing same.
In the second solution, I created my own writable type that implemented 
WritableComparable. Also implemented all the mapping functions for the 
same, so that it can be used with crunch WritableTypes.

/class NamesComparable implements WritableComparable<NamesComparable>{ ......}

MapFn<String,//NamesComparable//> string_to_names =.........
MapFn<//NamesComparable,String//> names_to_string =........./

/
/
Then  I used this while converting the read data into it and then 
sorting it.

     PCollection<String> readLines = pipeline.readTextFile(fileLoc);
     PCollection<String> lines = readLines.parallelDo(new DoFn<String, String>()
{
       @Override
      public void process(String input, Emitter<String> emitter) { emitter.emit(input);}},
      *stringToNames*());

     PCollection<String> sortedData = Sort.sort(lines, Order.DESCENDING);


I found of these methods as quite tricky that give a feeling of going 
around a bush. Is there a better way of accomplishing the same ? Have I 
missed some aspects ?
If not, then  I believe there is scope of having an Sorting API that can 
have support of some customizations.

regards
Rahul


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message