Return-Path: X-Original-To: apmail-incubator-crunch-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-crunch-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 168E0DBC6 for ; Wed, 18 Jul 2012 10:29:29 +0000 (UTC) Received: (qmail 49385 invoked by uid 500); 18 Jul 2012 10:29:29 -0000 Delivered-To: apmail-incubator-crunch-dev-archive@incubator.apache.org Received: (qmail 49334 invoked by uid 500); 18 Jul 2012 10:29:28 -0000 Mailing-List: contact crunch-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: crunch-dev@incubator.apache.org Delivered-To: mailing list crunch-dev@incubator.apache.org Received: (qmail 49316 invoked by uid 99); 18 Jul 2012 10:29:28 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Jul 2012 10:29:28 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [93.94.224.195] (HELO owa.exchange-login.net) (93.94.224.195) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Jul 2012 10:29:17 +0000 Received: from HC4.hosted.exchange-login.net (93.94.224.203) by edge2.hosted.exchange-login.net (93.94.224.195) with Microsoft SMTP Server (TLS) id 14.2.298.4; Wed, 18 Jul 2012 12:29:08 +0200 Received: from [192.168.1.132] (93.94.224.250) by hc4.hosted.exchange-login.net (93.94.224.203) with Microsoft SMTP Server (TLS) id 14.2.298.4; Wed, 18 Jul 2012 12:28:55 +0200 Message-ID: <50068FDF.4010201@xebia.com> Date: Wed, 18 Jul 2012 15:58:47 +0530 From: Rahul User-Agent: Mozilla/5.0 (Windows NT 6.0; WOW64; rv:14.0) Gecko/20120713 Thunderbird/14.0 MIME-Version: 1.0 To: Subject: Customized Sorting Content-Type: multipart/alternative; boundary="------------020105090106020402060808" X-Originating-IP: [93.94.224.250] --------------020105090106020402060808 Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: 7bit I am trying to sort some data. The data had names and I was try to sort in the following manner. *ORIGINAL DATA* * SORTED DATA* /Rahul shekhar/ /rahul Sameer/ /RAHUL ===== rahul/ /shekar ===== Rahul/ /hans RAHul/ /kasper kasper/ /Sameer hans/ / / This was a bit customized Sorting where I wanted to first sort them in lexicographic manner and then maybe take capitalization also into consideration. Initially I was trying with the Sort API but was unsuccessful with that. But then I tried in a couple of ways as explained below : In the first solution, I outputted each of the names them against their starting character in a /Ptable/. Then collected all the values for a particular key. After that I selected all the values and then used a /Comparator /to sort data in each of the collection. /PTable classifiedData = count.parallelDo( new NamesClassification(),Writables.tableOf(Writables.strings(),Writables.strings())); PTable collectedValues = classifiedData.collectValues(); PCollection names = collectedValues.values(); PCollection> sortedNames = names.parallelDo("names Sorting",new NamesSorting(), Writables.collections(Writables.strings()));/ Not completely convinced with the path I took. I spend some time of solving it and found another way of doing same. In the second solution, I created my own writable type that implemented WritableComparable. Also implemented all the mapping functions for the same, so that it can be used with crunch WritableTypes. /class NamesComparable implements WritableComparable{ ......} MapFn string_to_names =......... MapFn names_to_string =........./ / / Then I used this while converting the read data into it and then sorting it. PCollection readLines = pipeline.readTextFile(fileLoc); PCollection lines = readLines.parallelDo(new DoFn() { @Override public void process(String input, Emitter emitter) { emitter.emit(input);}}, *stringToNames*()); PCollection sortedData = Sort.sort(lines, Order.DESCENDING); I found of these methods as quite tricky that give a feeling of going around a bush. Is there a better way of accomplishing the same ? Have I missed some aspects ? If not, then I believe there is scope of having an Sorting API that can have support of some customizations. regards Rahul --------------020105090106020402060808--