Mailing-List: contact dev-help@asterixdb.incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@asterixdb.incubator.apache.org
Subject: Re: Two query questions on time bin function
To: Yiran Wang <wyr4137@gmail.com>
References: 
 <CAMUvO-NiBfTftTWMC7JdTO=CZ3HV2dbxv2TMnnpwD4BrHmWycw@mail.gmail.com>
 <CAN_YF5xL95VLn84cQBRNdYOW-k5_botNiuLjvsHmDVPx09T7wg@mail.gmail.com>
 <56CA35EC.3060307@gmail.com>
 <CAMUvO-MutzkN5R1Ba3BAGriYz=_9B8edOmPLFEi9=FGR7yf+uQ@mail.gmail.com>
 <56CA406B.3080704@gmail.com>
 <CAMUvO-OBg5cwow9fU9WVabjzUpK-aAoK8VkCLYcihfXV-ywntQ@mail.gmail.com>
Cc: dev@asterixdb.incubator.apache.org, Gloria Mark <gmark@uci.edu>,
 Vassilis Tsotras <tsotras@cs.ucr.edu>
From: Mike Carey <dtabass@gmail.com>
Message-ID: <56CBA89D.5040705@gmail.com>
Date: Mon, 22 Feb 2016 16:32:29 -0800
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:38.0)
 Gecko/20100101 Thunderbird/38.5.1
MIME-Version: 1.0
In-Reply-To: 
 <CAMUvO-OBg5cwow9fU9WVabjzUpK-aAoK8VkCLYcihfXV-ywntQ@mail.gmail.com>
Content-Type: multipart/alternative;
 boundary="------------090204090600060709060304"

--------------090204090600060709060304
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit

Approaching Big Data!  :-)  Interesting....!
Q: Is each post-binning row a bin, meaning there are only ~86 rows per bin?
(Am I interpreting that correctly?)
If so, that's good news; materializing any one given bin shouldn't be a 
problem for our runtime, so maybe we can indeed get this to work in the 
short term.

Sorry for the hassles w/this.....!
Cheers,
Mike


On 2/22/16 7:09 AM, Yiran Wang wrote:
> Mike,
>
> The original dataset has 31132597 rows of records. After binning it 
> into 1-min time bin dataset, it has 363466 rows of records.
>
> Thanks,
> Yiran
>
> On Sun, Feb 21, 2016 at 2:55 PM, Mike Carey <dtabass@gmail.com 
> <mailto:dtabass@gmail.com>> wrote:
>
>     ARGH!!!! This is what we would like you to *not* have to do. Sorry...
>     Our aim is to be the Big Data antidote....
>
>     @Yingyi:  Maybe you could take a quick peek at the query issue and
>     see if there is any low-hanging hope there?
>
>     @Yiran:  How big are your windows, typically?  (Number of data
>     points.)
>
>     Cheers,
>     Mike
>
>
>     On 2/21/16 2:42 PM, Yiran Wang wrote:
>>     Thank you Mike for your update and suggestions! And thank you Ian
>>     again for working with me.
>>
>>     A little update from my end:
>>
>>     I have been working on query (1) over the weekend for a
>>     work-around. I tried to simultaneously calculate the stdev on the
>>     new copy of the list of values with outliers removed, which in
>>     nature is the same as the query (2). So I ran into the same
>>     problem that the query did not compile.
>>
>>     What I did was to export the entire dataset with the outliers
>>     removed into Excel and calculate the stdev in Excel. However, the
>>     entire dataset is now 363,466 x 200+ in dimension. Though they do
>>     not exceed the row x col limit in Excel, the memory of my
>>     computer is not big enough to do anything useful without
>>     crashing. So I've been breaking the dataset into smaller parts
>>     and working on each separately.
>>
>>     Yiran
>>
>>
>>
>>     On Sun, Feb 21, 2016 at 2:10 PM, Mike Carey <dtabass@gmail.com
>>     <mailto:dtabass@gmail.com>> wrote:
>>
>>         Ian,
>>
>>         Thanks working with Yiran on this!  I think there is "good"
>>         and bad news w.r.t these queries:
>>
>>          - The bad news is that they go beyond what we are likely to
>>         optimize at all well at present,
>>             as they go beyond what typical DB aggregate functions
>>         like min/max/avg/count/sum do.
>>             (I would try forming the groups and then doing these
>>         things on the groups, but saying
>>             them in AQL will be tricky, and may lead to queries that
>>         hit edge cases in the optimizer.
>>             For some of these my thought was to try using a
>>         positional variable within a group...?)
>>
>>          - The "good" news (only for AsterixDB) is that this is
>>         exactly the sort of inspiration that we
>>             are looking for in terms of understanding how to better
>>         for query-based analytics in real
>>             use cases (and this is a very real one!).
>>
>>         To quote a short paper I reviewed just this AM on SQL queries
>>         kind of like these:  "Percentage
>>         queries are more complex than their conventional counterparts
>>         and introduce new challenges
>>         for optimization."  (The paper didn't have an applicable
>>         solution for us, sadly.)
>>
>>         A more general facility that I wish we could offer was to do
>>         grouping in AsterixDB but then
>>         have the ability to pass a group to (e.g.) R and then get
>>         results back for the group.  When
>>         groups are small-ish (like Yiran's windows) that would be
>>         pretty cool - then one could do
>>         all sorts of advanced things per group.
>>
>>         Cheers,
>>         Mike
>>
>>         On 2/21/16 12:35 AM, Ian Maxon wrote:
>>>         Yiran and I came up with possible answers for these...
>>>         For 1) , a function could be used that looks something like this:
>>>
>>>         declare function minmax($x){
>>>         let $stdv := (avg(for $z in $x return $z*$z) - avg($x) * avg($x))^(0.5)
>>>         for $y in $x
>>>         where $y < (2*$stdv) + avg($x)
>>>         and $y > avg($x) - (2*$stdv)
>>>         return $y
>>>         }
>>>
>>>
>>>
>>>         And then applied to return a new copy of the list of values, removing ones
>>>         that are outside of 2 stdev.
>>>
>>>         For 2), we also did come up with a potential solution ,but the query fails
>>>         to compile (Filed ashttps://issues.apache.org/jira/browse/ASTERIXDB-1308  )
>>>
>>>
>>>         Any thoughts on these queries would be welcome :) 1) especially seems
>>>         inefficient to do as a function.
>>>
>>>         - Ian
>>>
>>>         On Fri, Feb 19, 2016 at 3:37 PM, Yiran Wang<wyr4137@gmail.com> <mailto:wyr4137@gmail.com>  wrote:
>>>
>>>>         Hi Asterix team,
>>>>
>>>>         I have two queries I'm struggling with. I'm hoping you could provide a
>>>>         direction for me. Thanks in advance!
>>>>
>>>>         Here is what the data structure looks like:
>>>>
>>>>         create type HRMType as closed {
>>>>
>>>>            row_id: int32,
>>>>
>>>>            sid: int32,
>>>>
>>>>            date: date,
>>>>
>>>>            day: int32,
>>>>
>>>>            time: time,
>>>>
>>>>            bpm: int32,
>>>>
>>>>            RR: float
>>>>
>>>>         };
>>>>
>>>>         create dataset HRM (HRMType)
>>>>
>>>>         primary key row_id;
>>>>
>>>>
>>>>         Previously I have used the time bin function to calculate the standard
>>>>         deviation of bpm for each time bin:
>>>>
>>>>         for $i in dataset HRM
>>>>
>>>>         group by $sid := $i.sid, $gdate := $i.date, $gday := $i.day, $timebin :=
>>>>         interval-bin($i.time, time("00:00:00"), day-time-duration("PT1M")) with $i
>>>>
>>>>         return {
>>>>
>>>>         "sid": $sid,
>>>>
>>>>         "gdate": $gdate,
>>>>
>>>>         "gday": $gday,
>>>>
>>>>         "timebin": $timebin,
>>>>
>>>>         "stdv": (avg(for $ii in $i return $ii.RR * $ii.RR) - avg(for $ii in $i
>>>>         return $ii.RR) * avg(for $ii in $i return $ii.RR))^(0.5)};
>>>>
>>>>         ​Now I have two things I am hoping to do but need help with:
>>>>
>>>>         1. For each 1-min time bin, remove the bpm values that are above the top
>>>>         5% or below the bottom 5%. I thought about using the min/max function for a
>>>>         few times to achieve this, but realized that it was not a good idea because
>>>>         in each time bin, the number of instances are not always the same. So for
>>>>         each 1-min time bin, we do need to calculate the 5% and 95% threshold, and
>>>>         remove instances accordingly, which I don't know how to do.
>>>>
>>>>         2. After removing the outliers of bpm for each 1-min time bin, calculate a
>>>>         median absolute deviation (MAD) for each 1-min time bin (as another measure
>>>>         of variation besides the standard deviation). MAD =
>>>>         median(abs(x-median(x)). I'm not sure how to write a query to do the median
>>>>         function in Asterix.
>>>>
>>>>         Thank you so much in advance. Let me know if my questions are clear.
>>>>
>>>>         Yiran
>>>>
>>>>         --
>>>>         Best,
>>>>         Yiran
>>>>
>>>>         --
>>>>         You received this message because you are subscribed to the Google Groups
>>>>         "asterixdb-dev" group.
>>>>         To unsubscribe from this group and stop receiving emails from it, send an
>>>>         email toasterixdb-dev+unsubscribe@googlegroups.com
>>>>         <mailto:asterixdb-dev+unsubscribe@googlegroups.com>.
>>>>         For more options, visithttps://groups.google.com/d/optout.
>>>>
>>
>>
>>
>>
>>     -- 
>>     Best,
>>     Yiran
>
>
>
>
> -- 
> Best,
> Yiran


--------------090204090600060709060304--