apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From AJAY GUPTA <ajaygit...@gmail.com>
Subject Re: Sort Accumulation
Date Fri, 17 Mar 2017 05:35:24 GMT
Hi

I am currently going ahead with the implementation of Sort Accumulation
using List as mentioned in previous mail. Sorted insertion of elements will
be ensured via binary search.

Ajay


On Wed, Mar 15, 2017 at 4:49 PM, AJAY GUPTA <ajaygit158@gmail.com> wrote:

> Hi Bright,
>
> Its true that with these data structures, it could go out of memory.
> However, as an initial version, I believe its best to make use of in
> memory data structures for now. Disk based approach will require us to
> implement a B+ tree based data structure on hdfs.
>
> Also, I was studying more about TreeMultiset and have come to conclude
> that TreeMultiSet as the Accumulation type will not work due to the way
> TreeMultiSet handles duplicates (maintains count based on key, not the
> actual record itself).
> Hence, its best to make use of a list and use binary search to insert the
> record in sorted order.
>
>
> Ajay
>
>
>
> On Tue, Mar 14, 2017 at 9:39 PM, Bright Chen <bright@datatorrent.com>
> wrote:
>
>> Hi Ajay,
>> My feeling is you assume handle all tuples in memory.
>> How to handle the case if memory is not enough to hold all tuples?
>>
>> thanks
>> -Bright
>>
>> On Mon, Mar 13, 2017 at 11:00 PM, AJAY GUPTA <ajaygit158@gmail.com>
>> wrote:
>>
>> > Hi Apex Dev community,
>> >
>> > Kindly let me know if implementing this accumulation using TreeMultiSet
>> is
>> > fine.
>> >
>> >
>> > Ajay
>> >
>> > On Thu, Mar 9, 2017 at 12:24 PM, AJAY GUPTA <ajaygit158@gmail.com>
>> wrote:
>> >
>> > > Hi Bright,
>> > >
>> > > I couldnot completely understand the bucketing approach you mentioned.
>> > How
>> > > would we bucket the data considering we have no idea what the data
>> will
>> > be?
>> > >
>> > > How about using a TreeMultiSet?
>> > >
>> > >
>> > > Thanks,
>> > > Ajay
>> > >
>> > > On Wed, Mar 8, 2017 at 11:24 PM, Bright Chen <bright@datatorrent.com>
>> > > wrote:
>> > >
>> > >> Hi Ajay,
>> > >> I think sort at getOutput() probably will get this method stuck due
>> to
>> > >> very
>> > >> high volume of computation.
>> > >> And as we still need to persistent the data, it will not very
>> helpful to
>> > >> increase the performance of processing tuple. Probably we can bucket
>> the
>> > >> data with range of value. Such as following:
>> > >> - process tuple in one window: sort data of current window in memory
>> > >> - end window: merge the sorted memory data into buckets.
>> > >>
>> > >> thanks
>> > >> Bright
>> > >>
>> > >> On Wed, Mar 8, 2017 at 8:51 AM, AJAY GUPTA <ajaygit158@gmail.com>
>> > wrote:
>> > >>
>> > >> > Hi Thomas,
>> > >> >
>> > >> > I looked at TopN. The accumulate() of TopN is an O(n*k). Using
>> similar
>> > >> > approach for Sort will lead to an O(n^2) complexity.
>> > >> > Since we have to sort all elements, we can do it in a single sort
>> call
>> > >> in
>> > >> > getOutput().
>> > >> >
>> > >> >
>> > >> > On Wed, Mar 8, 2017 at 10:09 PM, Thomas Weise <thw@apache.org>
>> wrote:
>> > >> >
>> > >> > > Look at the existing topN accumulation. It should be a
>> > generalization,
>> > >> > > where you don't have a limit.
>> > >> > >
>> > >> > >
>> > >> > > On Wed, Mar 8, 2017 at 8:05 AM, AJAY GUPTA <ajaygit158@gmail.com
>> >
>> > >> wrote:
>> > >> > >
>> > >> > > > Hi,
>> > >> > > >
>> > >> > > > I would like to propose the Sort Accumulation. The accumulation
>> > >> will be
>> > >> > > > responsible for sorting the input POJO stream. The accumulation
>> > will
>> > >> > > > require a comparator to compare and sort the input tuples.
>> Another
>> > >> > > boolean
>> > >> > > > parameter "sortDesc" will be used to decide sorting
order.
>> > >> > > >
>> > >> > > > Let me know your views.
>> > >> > > >
>> > >> > > > Thanks,
>> > >> > > > Ajay
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> > >
>> > >
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message