crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: Customized Sorting
Date Wed, 18 Jul 2012 12:51:13 GMT
Hey Rahul,

I don't quite follow-- why did the Sort API work in the second case, but
not the first? org.apache.hadoop.io.Text, the underlying Writable type for
Strings, is also WritableComparable. Did NamesComparable do something that
Text does not?

J

On Wed, Jul 18, 2012 at 3:28 AM, Rahul <rsharma@xebia.com> wrote:

> I am trying to  sort some data. The data had names and I was try to sort
> in the following manner.
>
> *ORIGINAL DATA* *  SORTED DATA*
> /Rahul                                               shekhar/
> /rahul                                                Sameer/
> /RAHUL              =====                     rahul/
> /shekar               =====                     Rahul/
> /hans                                                 RAHul/
> /kasper                                              kasper/
> /Sameer                                             hans/
> /
> /
> This was a bit customized Sorting where I wanted to first sort them in
> lexicographic manner and then maybe take capitalization also into
> consideration.
> Initially I was trying with the Sort API but was unsuccessful with that.
> But then I tried in a couple of ways as explained below :
>
> In the first solution, I outputted each of the names them against their
> starting character in a /Ptable/. Then collected all the values for a
> particular key.
> After that I selected all the values and then used a /Comparator /to sort
> data in each of the collection.
>
>  /PTable<String, String> classifiedData = count.parallelDo( new
> NamesClassification(),**Writables.tableOf(Writables.**
> strings(),Writables.strings())**);
>  PTable<String, Collection<String> collectedValues =
> classifiedData.collectValues()**;
>  PCollection<Collection<String> names = collectedValues.values();
>  PCollection<Collection<String>**> sortedNames = names.parallelDo("names
> Sorting",new NamesSorting(), Writables.collections(**
> Writables.strings()));/
>
>
> Not completely convinced with the path I took. I spend some time of
> solving it and found another way of doing same.
> In the second solution, I created my own writable type that implemented
> WritableComparable. Also implemented all the mapping functions for the
> same, so that it can be used with crunch WritableTypes.
>
> /class NamesComparable implements WritableComparable<**NamesComparable>{
> ......}
>
> MapFn<String,//**NamesComparable//> string_to_names =.........
> MapFn<//NamesComparable,**String//> names_to_string =........./
>
> /
> /
> Then  I used this while converting the read data into it and then sorting
> it.
>
>     PCollection<String> readLines = pipeline.readTextFile(fileLoc)**;
>     PCollection<String> lines = readLines.parallelDo(new DoFn<String,
> String>() {
>       @Override
>      public void process(String input, Emitter<String> emitter) {
> emitter.emit(input);}},
>      *stringToNames*());
>
>     PCollection<String> sortedData = Sort.sort(lines, Order.DESCENDING);
>
>
> I found of these methods as quite tricky that give a feeling of going
> around a bush. Is there a better way of accomplishing the same ? Have I
> missed some aspects ?
> If not, then  I believe there is scope of having an Sorting API that can
> have support of some customizations.
>
> regards
> Rahul
>
>


-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message