Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id AED61200C3A for ; Fri, 17 Mar 2017 06:42:08 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id AD508160B8B; Fri, 17 Mar 2017 05:42:08 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id D1E72160B78 for ; Fri, 17 Mar 2017 06:42:07 +0100 (CET) Received: (qmail 89515 invoked by uid 500); 17 Mar 2017 05:42:07 -0000 Mailing-List: contact dev-help@apex.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@apex.apache.org Delivered-To: mailing list dev@apex.apache.org Received: (qmail 89500 invoked by uid 99); 17 Mar 2017 05:42:06 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 17 Mar 2017 05:42:06 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 3BC4518F164 for ; Fri, 17 Mar 2017 05:42:06 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.129 X-Spam-Level: ** X-Spam-Status: No, score=2.129 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id kAgLStcwNgfv for ; Fri, 17 Mar 2017 05:42:04 +0000 (UTC) Received: from mail-ua0-f180.google.com (mail-ua0-f180.google.com [209.85.217.180]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 887125F5C9 for ; Fri, 17 Mar 2017 05:42:03 +0000 (UTC) Received: by mail-ua0-f180.google.com with SMTP id q7so38548797uaf.2 for ; Thu, 16 Mar 2017 22:42:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=Qy2ainxf7aPJujbJejtZCWIL/KwMCya9+bOesxw7onc=; b=Qq+Ix8FwfwsEGpZN5EEN3cdjVROKw/xkEcYn+xID4xjhcUbTQ3yZNHnkHiovmBGbyZ lGvPIiOY0p2+RZViU1EdK7zYlqEdFNo6xfcv7vidpCGfaa9LGzYVqGLDO1G2CSTTu1B3 QvpH4mwqKsgCL4pMqKAmtRsTdSNCqS4GKf/9ptSZrhNB8VSiGIWzGLVROyBIaBkid/dP VawRrunqhkskNxFkKe5k83j/yj+Gar0Ny84O/dYd3CKVgSN4nTmjQG5RQ/OanxHyOHrj a3gN+F5hxwfJfQgRCyV3EASi8ErC4+TbkyF8QVSVjq+47wGGRHh7SDjSvf8SABuSRWbV SM5Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=Qy2ainxf7aPJujbJejtZCWIL/KwMCya9+bOesxw7onc=; b=Xy2v3fjXUrvMXb2xBRavZ2nVHBqaCUwvp9KtB+4Tq7lhw5VzGUtkRhNmlVZJqlfJjC H32PGMc6zxDXWFYrcqxPq1KBmcEheYBg+Pmp1nOd5a9DaYgp/UsQ5a/zivBVIXifs7Cc Q7RsuYUFzek2spXsvbhaD6nL5hC0aaIYkiAgA3m7WRUFAB037cd1CYZ/2Zj3Irm3tgUn KShI2BK8D4bJ8Km+PHN0bhBFVeSSG345PNiJYHgGBylGusYFRgWvDFrYi1zr4iCi0/JS mrNa2oRIhmAQRncNqWmTHp4nKaJFUlIaWQEx5vXqVQAEQ12oPeZ9m2Jw9N6jSNAXoktY 6VSg== X-Gm-Message-State: AFeK/H3b080gCYZiQHqpuT0pJbQDV2AfDXEV11IhicZDAhRr3eXUD/OZE5OmobCXeh6qDQd4WbNT2qY+9L5cKA== X-Received: by 10.176.81.208 with SMTP id h16mr5854489uaa.75.1489728924948; Thu, 16 Mar 2017 22:35:24 -0700 (PDT) MIME-Version: 1.0 Received: by 10.103.121.131 with HTTP; Thu, 16 Mar 2017 22:35:24 -0700 (PDT) In-Reply-To: References: From: AJAY GUPTA Date: Fri, 17 Mar 2017 11:05:24 +0530 Message-ID: Subject: Re: Sort Accumulation To: dev@apex.apache.org Content-Type: multipart/alternative; boundary=94eb2c19109c6220e2054ae68bc4 archived-at: Fri, 17 Mar 2017 05:42:08 -0000 --94eb2c19109c6220e2054ae68bc4 Content-Type: text/plain; charset=UTF-8 Hi I am currently going ahead with the implementation of Sort Accumulation using List as mentioned in previous mail. Sorted insertion of elements will be ensured via binary search. Ajay On Wed, Mar 15, 2017 at 4:49 PM, AJAY GUPTA wrote: > Hi Bright, > > Its true that with these data structures, it could go out of memory. > However, as an initial version, I believe its best to make use of in > memory data structures for now. Disk based approach will require us to > implement a B+ tree based data structure on hdfs. > > Also, I was studying more about TreeMultiset and have come to conclude > that TreeMultiSet as the Accumulation type will not work due to the way > TreeMultiSet handles duplicates (maintains count based on key, not the > actual record itself). > Hence, its best to make use of a list and use binary search to insert the > record in sorted order. > > > Ajay > > > > On Tue, Mar 14, 2017 at 9:39 PM, Bright Chen > wrote: > >> Hi Ajay, >> My feeling is you assume handle all tuples in memory. >> How to handle the case if memory is not enough to hold all tuples? >> >> thanks >> -Bright >> >> On Mon, Mar 13, 2017 at 11:00 PM, AJAY GUPTA >> wrote: >> >> > Hi Apex Dev community, >> > >> > Kindly let me know if implementing this accumulation using TreeMultiSet >> is >> > fine. >> > >> > >> > Ajay >> > >> > On Thu, Mar 9, 2017 at 12:24 PM, AJAY GUPTA >> wrote: >> > >> > > Hi Bright, >> > > >> > > I couldnot completely understand the bucketing approach you mentioned. >> > How >> > > would we bucket the data considering we have no idea what the data >> will >> > be? >> > > >> > > How about using a TreeMultiSet? >> > > >> > > >> > > Thanks, >> > > Ajay >> > > >> > > On Wed, Mar 8, 2017 at 11:24 PM, Bright Chen >> > > wrote: >> > > >> > >> Hi Ajay, >> > >> I think sort at getOutput() probably will get this method stuck due >> to >> > >> very >> > >> high volume of computation. >> > >> And as we still need to persistent the data, it will not very >> helpful to >> > >> increase the performance of processing tuple. Probably we can bucket >> the >> > >> data with range of value. Such as following: >> > >> - process tuple in one window: sort data of current window in memory >> > >> - end window: merge the sorted memory data into buckets. >> > >> >> > >> thanks >> > >> Bright >> > >> >> > >> On Wed, Mar 8, 2017 at 8:51 AM, AJAY GUPTA >> > wrote: >> > >> >> > >> > Hi Thomas, >> > >> > >> > >> > I looked at TopN. The accumulate() of TopN is an O(n*k). Using >> similar >> > >> > approach for Sort will lead to an O(n^2) complexity. >> > >> > Since we have to sort all elements, we can do it in a single sort >> call >> > >> in >> > >> > getOutput(). >> > >> > >> > >> > >> > >> > On Wed, Mar 8, 2017 at 10:09 PM, Thomas Weise >> wrote: >> > >> > >> > >> > > Look at the existing topN accumulation. It should be a >> > generalization, >> > >> > > where you don't have a limit. >> > >> > > >> > >> > > >> > >> > > On Wed, Mar 8, 2017 at 8:05 AM, AJAY GUPTA > > >> > >> wrote: >> > >> > > >> > >> > > > Hi, >> > >> > > > >> > >> > > > I would like to propose the Sort Accumulation. The accumulation >> > >> will be >> > >> > > > responsible for sorting the input POJO stream. The accumulation >> > will >> > >> > > > require a comparator to compare and sort the input tuples. >> Another >> > >> > > boolean >> > >> > > > parameter "sortDesc" will be used to decide sorting order. >> > >> > > > >> > >> > > > Let me know your views. >> > >> > > > >> > >> > > > Thanks, >> > >> > > > Ajay >> > >> > > > >> > >> > > >> > >> > >> > >> >> > > >> > > >> > >> > > --94eb2c19109c6220e2054ae68bc4--