asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Carey <dtab...@gmail.com>
Subject Re: Two query questions on time bin function
Date Sun, 21 Feb 2016 22:10:52 GMT
Ian,

Thanks working with Yiran on this!  I think there is "good" and bad news 
w.r.t these queries:

  - The bad news is that they go beyond what we are likely to optimize 
at all well at present,
     as they go beyond what typical DB aggregate functions like 
min/max/avg/count/sum do.
     (I would try forming the groups and then doing these things on the 
groups, but saying
     them in AQL will be tricky, and may lead to queries that hit edge 
cases in the optimizer.
     For some of these my thought was to try using a positional variable 
within a group...?)

  - The "good" news (only for AsterixDB) is that this is exactly the 
sort of inspiration that we
     are looking for in terms of understanding how to better for 
query-based analytics in real
     use cases (and this is a very real one!).

To quote a short paper I reviewed just this AM on SQL queries kind of 
like these:  "Percentage
queries are more complex than their conventional counterparts and 
introduce new challenges
for optimization."  (The paper didn't have an applicable solution for 
us, sadly.)

A more general facility that I wish we could offer was to do grouping in 
AsterixDB but then
have the ability to pass a group to (e.g.) R and then get results back 
for the group.  When
groups are small-ish (like Yiran's windows) that would be pretty cool - 
then one could do
all sorts of advanced things per group.

Cheers,
Mike

On 2/21/16 12:35 AM, Ian Maxon wrote:
> Yiran and I came up with possible answers for these...
> For 1) , a function could be used that looks something like this:
>
> declare function minmax($x){
> let $stdv := (avg(for $z in $x return $z*$z) - avg($x) * avg($x))^(0.5)
> for $y in $x
> where $y < (2*$stdv) + avg($x)
> and $y > avg($x) - (2*$stdv)
> return $y
> }
>
>
>
> And then applied to return a new copy of the list of values, removing ones
> that are outside of 2 stdev.
>
> For 2), we also did come up with a potential solution ,but the query fails
> to compile (Filed as https://issues.apache.org/jira/browse/ASTERIXDB-1308 )
>
>
> Any thoughts on these queries would be welcome :) 1) especially seems
> inefficient to do as a function.
>
> - Ian
>
> On Fri, Feb 19, 2016 at 3:37 PM, Yiran Wang <wyr4137@gmail.com> wrote:
>
>> Hi Asterix team,
>>
>> I have two queries I'm struggling with. I'm hoping you could provide a
>> direction for me. Thanks in advance!
>>
>> Here is what the data structure looks like:
>>
>> create type HRMType as closed {
>>
>>    row_id: int32,
>>
>>    sid: int32,
>>
>>    date: date,
>>
>>    day: int32,
>>
>>    time: time,
>>
>>    bpm: int32,
>>
>>    RR: float
>>
>> };
>>
>> create dataset HRM (HRMType)
>>
>> primary key row_id;
>>
>>
>> Previously I have used the time bin function to calculate the standard
>> deviation of bpm for each time bin:
>>
>> for $i in dataset HRM
>>
>> group by $sid := $i.sid, $gdate := $i.date, $gday := $i.day, $timebin :=
>> interval-bin($i.time, time("00:00:00"), day-time-duration("PT1M")) with $i
>>
>> return {
>>
>> "sid": $sid,
>>
>> "gdate": $gdate,
>>
>> "gday": $gday,
>>
>> "timebin": $timebin,
>>
>> "stdv": (avg(for $ii in $i return $ii.RR * $ii.RR) - avg(for $ii in $i
>> return $ii.RR) * avg(for $ii in $i return $ii.RR))^(0.5)};
>>
>> ​Now I have two things I am hoping to do but need help with:
>>
>> 1. For each 1-min time bin, remove the bpm values that are above the top
>> 5% or below the bottom 5%. I thought about using the min/max function for a
>> few times to achieve this, but realized that it was not a good idea because
>> in each time bin, the number of instances are not always the same. So for
>> each 1-min time bin, we do need to calculate the 5% and 95% threshold, and
>> remove instances accordingly, which I don't know how to do.
>>
>> 2. After removing the outliers of bpm for each 1-min time bin, calculate a
>> median absolute deviation (MAD) for each 1-min time bin (as another measure
>> of variation besides the standard deviation). MAD =
>> median(abs(x-median(x)). I'm not sure how to write a query to do the median
>> function in Asterix.
>>
>> Thank you so much in advance. Let me know if my questions are clear.
>>
>> Yiran
>>
>> --
>> Best,
>> Yiran
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "asterixdb-dev" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to asterixdb-dev+unsubscribe@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message