asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Carey <dtab...@gmail.com>
Subject Re: Two query questions on time bin function
Date Tue, 23 Feb 2016 00:39:53 GMT
How about trying...

..... avg(for $irrc in $i.RR_clipped return $irrc) .....

That could make the compiler happier, potentially.
(Right now it isn't "feeling the love" in terms of the type of avg's 
argument - not sure why count worked.)

Cheers,
Mike


On 2/22/16 2:03 PM, Ian Maxon wrote:
> Regarding the latest query about computing some summary statistics from the
> min-maxed data, Yiran and I just finished meeting and we were able to come
> up with a work around.
>
> The query was as follows:
> use dataverse mt16
>
> for $i in dataset HRM_binned_clipped return {
> "row_id": $i.row_id,
> "sid": $i.sid,
> "gdate": $i.gdate,
> "gday": $i.gday,
> "timebin": $i.timebin,
> "stdv_RR_clipped": avg($i.RR_clipped)
> }
>
> However, it would fail like this:
>
> Type of argument in function-call: asterix:avg, Args:[function-call:
> asterix:field-access-by-name, Args:[%0->$$0, AString: {RR_clipped}]] should
> be a collection type instead of ANY [AlgebricksException]
>
>
> Because HRM_binned_clipped was open and there is some sort of bug in the
> avg() function for this. count() works fine.
> The work around, is to just copy everything into a closed dataset, where
> RR_clipped is of type [double]. The query then works.
>
>
> Again though this is kind of a work-around, to a work around. The original
> query:
>
>
> declare function minmax($x){
> let $stdv := (avg(for $z in $x return $z*$z) - avg($x) * avg($x))^(0.5)
> for $y in $x
> where $y < (2*$stdv) + avg($x)
> and $y > avg($x) - (2*$stdv)
> return $y
> }
>
> for $i in dataset HRM
> group by $sid := $i.sid, $gdate := $i.date, $gday := $i.day, $timebin :=
> interval-bin($i.time, time("00:00:00"), day-time-duration("PT1M")) with $i
> return {
> "sid": $sid,
> "gdate": $gdate,
> "gday": $gday,
> "timebin": $timebin,
> "stdv": (avg(for $ii in minmax(for $jj in $i return $jj.RR) return $ii.RR *
> $ii.RR) - avg(for $ii in minmax(for $jj in $i return $jj.RR) return $ii.RR)
> * avg(for $ii in minmax(for $jj in $i return $jj.RR) return $ii.RR))^(0.5)
> }
>
> Expresses the same thing without any intermediate datasets. This query
> fails in compilation (only with the avgs added). I need to get the stack
> from the version that Yiran is running on though, I can't reproduce it on
> master; it fails in a different way.
>
>
> -Ian
>
>
> On Sun, Feb 21, 2016 at 2:55 PM, Mike Carey <dtabass@gmail.com> wrote:
>
>> ARGH!!!!  This is what we would like you to *not* have to do.  Sorry...
>> Our aim is to be the Big Data antidote....
>>
>> @Yingyi:  Maybe you could take a quick peek at the query issue and
>> see if there is any low-hanging hope there?
>>
>> @Yiran:  How big are your windows, typically?  (Number of data points.)
>>
>> Cheers,
>> Mike
>>
>>
>> On 2/21/16 2:42 PM, Yiran Wang wrote:
>>
>>> Thank you Mike for your update and suggestions! And thank you Ian again
>>> for working with me.
>>>
>>> A little update from my end:
>>>
>>> I have been working on query (1) over the weekend for a work-around. I
>>> tried to simultaneously calculate the stdev on the new copy of the list of
>>> values with outliers removed, which in nature is the same as the query (2).
>>> So I ran into the same problem that the query did not compile.
>>>
>>> What I did was to export the entire dataset with the outliers removed
>>> into Excel and calculate the stdev in Excel. However, the entire dataset is
>>> now 363,466 x 200+ in dimension. Though they do not exceed the row x col
>>> limit in Excel, the memory of my computer is not big enough to do anything
>>> useful without crashing. So I've been breaking the dataset into smaller
>>> parts and working on each separately.
>>>
>>> Yiran
>>>
>>>
>>>
>>>
>>> On Sun, Feb 21, 2016 at 2:10 PM, Mike Carey <dtabass@gmail.com <mailto:
>>> dtabass@gmail.com>> wrote:
>>>
>>>      Ian,
>>>
>>>      Thanks working with Yiran on this!  I think there is "good" and
>>>      bad news w.r.t these queries:
>>>
>>>       - The bad news is that they go beyond what we are likely to
>>>      optimize at all well at present,
>>>          as they go beyond what typical DB aggregate functions like
>>>      min/max/avg/count/sum do.
>>>          (I would try forming the groups and then doing these things on
>>>      the groups, but saying
>>>          them in AQL will be tricky, and may lead to queries that hit
>>>      edge cases in the optimizer.
>>>          For some of these my thought was to try using a positional
>>>      variable within a group...?)
>>>
>>>       - The "good" news (only for AsterixDB) is that this is exactly
>>>      the sort of inspiration that we
>>>          are looking for in terms of understanding how to better for
>>>      query-based analytics in real
>>>          use cases (and this is a very real one!).
>>>
>>>      To quote a short paper I reviewed just this AM on SQL queries kind
>>>      of like these:  "Percentage
>>>      queries are more complex than their conventional counterparts and
>>>      introduce new challenges
>>>      for optimization."  (The paper didn't have an applicable solution
>>>      for us, sadly.)
>>>
>>>      A more general facility that I wish we could offer was to do
>>>      grouping in AsterixDB but then
>>>      have the ability to pass a group to (e.g.) R and then get results
>>>      back for the group.  When
>>>      groups are small-ish (like Yiran's windows) that would be pretty
>>>      cool - then one could do
>>>      all sorts of advanced things per group.
>>>
>>>      Cheers,
>>>      Mike
>>>
>>>      On 2/21/16 12:35 AM, Ian Maxon wrote:
>>>
>>>>      Yiran and I came up with possible answers for these...
>>>>      For 1) , a function could be used that looks something like this:
>>>>
>>>>      declare function minmax($x){
>>>>      let $stdv := (avg(for $z in $x return $z*$z) - avg($x) *
>>>> avg($x))^(0.5)
>>>>      for $y in $x
>>>>      where $y < (2*$stdv) + avg($x)
>>>>      and $y > avg($x) - (2*$stdv)
>>>>      return $y
>>>>      }
>>>>
>>>>
>>>>
>>>>      And then applied to return a new copy of the list of values,
>>>> removing ones
>>>>      that are outside of 2 stdev.
>>>>
>>>>      For 2), we also did come up with a potential solution ,but the query
>>>> fails
>>>>      to compile (Filed ashttps://
>>>> issues.apache.org/jira/browse/ASTERIXDB-1308  )
>>>>
>>>>
>>>>      Any thoughts on these queries would be welcome :) 1) especially seems
>>>>      inefficient to do as a function.
>>>>
>>>>      - Ian
>>>>
>>>>      On Fri, Feb 19, 2016 at 3:37 PM, Yiran Wang<wyr4137@gmail.com>
>>>> <mailto:wyr4137@gmail.com>  wrote:
>>>>
>>>>      Hi Asterix team,
>>>>>      I have two queries I'm struggling with. I'm hoping you could
>>>>> provide a
>>>>>      direction for me. Thanks in advance!
>>>>>
>>>>>      Here is what the data structure looks like:
>>>>>
>>>>>      create type HRMType as closed {
>>>>>
>>>>>         row_id: int32,
>>>>>
>>>>>         sid: int32,
>>>>>
>>>>>         date: date,
>>>>>
>>>>>         day: int32,
>>>>>
>>>>>         time: time,
>>>>>
>>>>>         bpm: int32,
>>>>>
>>>>>         RR: float
>>>>>
>>>>>      };
>>>>>
>>>>>      create dataset HRM (HRMType)
>>>>>
>>>>>      primary key row_id;
>>>>>
>>>>>
>>>>>      Previously I have used the time bin function to calculate the
>>>>> standard
>>>>>      deviation of bpm for each time bin:
>>>>>
>>>>>      for $i in dataset HRM
>>>>>
>>>>>      group by $sid := $i.sid, $gdate := $i.date, $gday := $i.day,
>>>>> $timebin :=
>>>>>      interval-bin($i.time, time("00:00:00"), day-time-duration("PT1M"))
>>>>> with $i
>>>>>
>>>>>      return {
>>>>>
>>>>>      "sid": $sid,
>>>>>
>>>>>      "gdate": $gdate,
>>>>>
>>>>>      "gday": $gday,
>>>>>
>>>>>      "timebin": $timebin,
>>>>>
>>>>>      "stdv": (avg(for $ii in $i return $ii.RR * $ii.RR) - avg(for $ii
in
>>>>> $i
>>>>>      return $ii.RR) * avg(for $ii in $i return $ii.RR))^(0.5)};
>>>>>
>>>>>      ​Now I have two things I am hoping to do but need help with:
>>>>>
>>>>>      1. For each 1-min time bin, remove the bpm values that are above
>>>>> the top
>>>>>      5% or below the bottom 5%. I thought about using the min/max
>>>>> function for a
>>>>>      few times to achieve this, but realized that it was not a good idea
>>>>> because
>>>>>      in each time bin, the number of instances are not always the same.
>>>>> So for
>>>>>      each 1-min time bin, we do need to calculate the 5% and 95%
>>>>> threshold, and
>>>>>      remove instances accordingly, which I don't know how to do.
>>>>>
>>>>>      2. After removing the outliers of bpm for each 1-min time bin,
>>>>> calculate a
>>>>>      median absolute deviation (MAD) for each 1-min time bin (as another
>>>>> measure
>>>>>      of variation besides the standard deviation). MAD =
>>>>>      median(abs(x-median(x)). I'm not sure how to write a query to do
>>>>> the median
>>>>>      function in Asterix.
>>>>>
>>>>>      Thank you so much in advance. Let me know if my questions are clear.
>>>>>
>>>>>      Yiran
>>>>>
>>>>>      --
>>>>>      Best,
>>>>>      Yiran
>>>>>
>>>>>      --
>>>>>      You received this message because you are subscribed to the Google
>>>>> Groups
>>>>>      "asterixdb-dev" group.
>>>>>      To unsubscribe from this group and stop receiving emails from it,
>>>>> send an
>>>>>      email toasterixdb-dev+unsubscribe@googlegroups.com
>>>>>      <mailto:asterixdb-dev+unsubscribe@googlegroups.com>.
>>>>>      For more options, visithttps://groups.google.com/d/optout.
>>>>>
>>>>>
>>>
>>>
>>> --
>>> Best,
>>> Yiran
>>>
>>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message