Return-Path: X-Original-To: apmail-asterixdb-dev-archive@minotaur.apache.org Delivered-To: apmail-asterixdb-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6C8F518627 for ; Tue, 23 Feb 2016 00:32:36 +0000 (UTC) Received: (qmail 58705 invoked by uid 500); 23 Feb 2016 00:32:36 -0000 Delivered-To: apmail-asterixdb-dev-archive@asterixdb.apache.org Received: (qmail 58612 invoked by uid 500); 23 Feb 2016 00:32:36 -0000 Mailing-List: contact dev-help@asterixdb.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@asterixdb.incubator.apache.org Delivered-To: mailing list dev@asterixdb.incubator.apache.org Received: (qmail 58600 invoked by uid 99); 23 Feb 2016 00:32:35 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Feb 2016 00:32:35 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 75D9B1A420A for ; Tue, 23 Feb 2016 00:32:35 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.198 X-Spam-Level: ** X-Spam-Status: No, score=2.198 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_REPLY=1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx2-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id QVf-fvEdlRMZ for ; Tue, 23 Feb 2016 00:32:32 +0000 (UTC) Received: from mail-pf0-f175.google.com (mail-pf0-f175.google.com [209.85.192.175]) by mx2-lw-us.apache.org (ASF Mail Server at mx2-lw-us.apache.org) with ESMTPS id 174F55FAEC for ; Tue, 23 Feb 2016 00:32:32 +0000 (UTC) Received: by mail-pf0-f175.google.com with SMTP id x65so101401280pfb.1 for ; Mon, 22 Feb 2016 16:32:32 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=subject:to:references:cc:from:message-id:date:user-agent :mime-version:in-reply-to:content-type; bh=ux6BQoVUmkZSgnJQYJ3Bkn8UYDV48uDmfYX3rr9B/CU=; b=Fm/RH9VLaH0e+k26HlOadkLWO93riW7tdPQyaADPSrv+33zw6il3ajF+LAXEjozzfB LdTzrK8U8XC4rIxuqT+B+NFrwtx2q7MbpSodt5ncJkURUZHa0Uotis6QniRowmSeIgla Pmg2SbuZOLU363vxYVeA3CDOU3Aaf75rZi3PgYZPxPJrEpivdnvNtToENHNRABgUCwqB uRllyusdtBGqbtozX++hhmp1513uftGy7aSCC0Z4BXvgX5PQ1YvW/FdWt3a+8Fu/mNKD 0laa422xOWM8U65LCAjcKH+6NYkZpfac39YqrcZgQGS4nMcKBkp7J7Flccjzw/gReh8w uURw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:subject:to:references:cc:from:message-id:date :user-agent:mime-version:in-reply-to:content-type; bh=ux6BQoVUmkZSgnJQYJ3Bkn8UYDV48uDmfYX3rr9B/CU=; b=RDF85mdChY80tJ8eKSRYKC217nAhsaKfr8/kNulE8SKN8OLAJBeoq65sTmGFuNbSha uGKmwJdTnBCiAOs2Q/UxYLo9NjfrljjWEPoRs3sTpbWjLfBA86zMjjKadXnw5P63Ab2l DzaUDNG/9KY4yPE9gItPZHMk2ThQKU0DkoiU76ZJf4mopGpOClNlOqhNcwm3c5PzKRo3 BisvVyq0Pkzm8qxALJW5ve0GXFK0kF3GPhwwUQkSopGm/obmllhiC5+53KBuW/sccdoQ kZzX+tU9TcWu9fhM9D6QmEL8aMw7ZGsx24p4hVVqFD6BY9DvJFgWTurcWYx3V9srChHj Fe9A== X-Gm-Message-State: AG10YORSqRCzv5m5E82XyMZIvxiAZTu2IAmVTjzMPggmk8rN5VZCnaEzbuCpRqIGiRr5fQ== X-Received: by 10.98.13.216 with SMTP id 85mr41657755pfn.143.1456187551159; Mon, 22 Feb 2016 16:32:31 -0800 (PST) Received: from dhcp-053184.ics.uci.edu (dhcp-053184.ics.uci.edu. [128.195.53.184]) by smtp.googlemail.com with ESMTPSA id bx1sm39643139pab.33.2016.02.22.16.32.29 (version=TLSv1/SSLv3 cipher=OTHER); Mon, 22 Feb 2016 16:32:30 -0800 (PST) Subject: Re: Two query questions on time bin function To: Yiran Wang References: <56CA35EC.3060307@gmail.com> <56CA406B.3080704@gmail.com> Cc: dev@asterixdb.incubator.apache.org, Gloria Mark , Vassilis Tsotras From: Mike Carey Message-ID: <56CBA89D.5040705@gmail.com> Date: Mon, 22 Feb 2016 16:32:29 -0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:38.0) Gecko/20100101 Thunderbird/38.5.1 MIME-Version: 1.0 In-Reply-To: Content-Type: multipart/alternative; boundary="------------090204090600060709060304" --------------090204090600060709060304 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Approaching Big Data! :-) Interesting....! Q: Is each post-binning row a bin, meaning there are only ~86 rows per bin? (Am I interpreting that correctly?) If so, that's good news; materializing any one given bin shouldn't be a problem for our runtime, so maybe we can indeed get this to work in the short term. Sorry for the hassles w/this.....! Cheers, Mike On 2/22/16 7:09 AM, Yiran Wang wrote: > Mike, > > The original dataset has 31132597 rows of records. After binning it > into 1-min time bin dataset, it has 363466 rows of records. > > Thanks, > Yiran > > On Sun, Feb 21, 2016 at 2:55 PM, Mike Carey > wrote: > > ARGH!!!! This is what we would like you to *not* have to do. Sorry... > Our aim is to be the Big Data antidote.... > > @Yingyi: Maybe you could take a quick peek at the query issue and > see if there is any low-hanging hope there? > > @Yiran: How big are your windows, typically? (Number of data > points.) > > Cheers, > Mike > > > On 2/21/16 2:42 PM, Yiran Wang wrote: >> Thank you Mike for your update and suggestions! And thank you Ian >> again for working with me. >> >> A little update from my end: >> >> I have been working on query (1) over the weekend for a >> work-around. I tried to simultaneously calculate the stdev on the >> new copy of the list of values with outliers removed, which in >> nature is the same as the query (2). So I ran into the same >> problem that the query did not compile. >> >> What I did was to export the entire dataset with the outliers >> removed into Excel and calculate the stdev in Excel. However, the >> entire dataset is now 363,466 x 200+ in dimension. Though they do >> not exceed the row x col limit in Excel, the memory of my >> computer is not big enough to do anything useful without >> crashing. So I've been breaking the dataset into smaller parts >> and working on each separately. >> >> Yiran >> >> >> >> On Sun, Feb 21, 2016 at 2:10 PM, Mike Carey > > wrote: >> >> Ian, >> >> Thanks working with Yiran on this! I think there is "good" >> and bad news w.r.t these queries: >> >> - The bad news is that they go beyond what we are likely to >> optimize at all well at present, >> as they go beyond what typical DB aggregate functions >> like min/max/avg/count/sum do. >> (I would try forming the groups and then doing these >> things on the groups, but saying >> them in AQL will be tricky, and may lead to queries that >> hit edge cases in the optimizer. >> For some of these my thought was to try using a >> positional variable within a group...?) >> >> - The "good" news (only for AsterixDB) is that this is >> exactly the sort of inspiration that we >> are looking for in terms of understanding how to better >> for query-based analytics in real >> use cases (and this is a very real one!). >> >> To quote a short paper I reviewed just this AM on SQL queries >> kind of like these: "Percentage >> queries are more complex than their conventional counterparts >> and introduce new challenges >> for optimization." (The paper didn't have an applicable >> solution for us, sadly.) >> >> A more general facility that I wish we could offer was to do >> grouping in AsterixDB but then >> have the ability to pass a group to (e.g.) R and then get >> results back for the group. When >> groups are small-ish (like Yiran's windows) that would be >> pretty cool - then one could do >> all sorts of advanced things per group. >> >> Cheers, >> Mike >> >> On 2/21/16 12:35 AM, Ian Maxon wrote: >>> Yiran and I came up with possible answers for these... >>> For 1) , a function could be used that looks something like this: >>> >>> declare function minmax($x){ >>> let $stdv := (avg(for $z in $x return $z*$z) - avg($x) * avg($x))^(0.5) >>> for $y in $x >>> where $y < (2*$stdv) + avg($x) >>> and $y > avg($x) - (2*$stdv) >>> return $y >>> } >>> >>> >>> >>> And then applied to return a new copy of the list of values, removing ones >>> that are outside of 2 stdev. >>> >>> For 2), we also did come up with a potential solution ,but the query fails >>> to compile (Filed ashttps://issues.apache.org/jira/browse/ASTERIXDB-1308 ) >>> >>> >>> Any thoughts on these queries would be welcome :) 1) especially seems >>> inefficient to do as a function. >>> >>> - Ian >>> >>> On Fri, Feb 19, 2016 at 3:37 PM, Yiran Wang wrote: >>> >>>> Hi Asterix team, >>>> >>>> I have two queries I'm struggling with. I'm hoping you could provide a >>>> direction for me. Thanks in advance! >>>> >>>> Here is what the data structure looks like: >>>> >>>> create type HRMType as closed { >>>> >>>> row_id: int32, >>>> >>>> sid: int32, >>>> >>>> date: date, >>>> >>>> day: int32, >>>> >>>> time: time, >>>> >>>> bpm: int32, >>>> >>>> RR: float >>>> >>>> }; >>>> >>>> create dataset HRM (HRMType) >>>> >>>> primary key row_id; >>>> >>>> >>>> Previously I have used the time bin function to calculate the standard >>>> deviation of bpm for each time bin: >>>> >>>> for $i in dataset HRM >>>> >>>> group by $sid := $i.sid, $gdate := $i.date, $gday := $i.day, $timebin := >>>> interval-bin($i.time, time("00:00:00"), day-time-duration("PT1M")) with $i >>>> >>>> return { >>>> >>>> "sid": $sid, >>>> >>>> "gdate": $gdate, >>>> >>>> "gday": $gday, >>>> >>>> "timebin": $timebin, >>>> >>>> "stdv": (avg(for $ii in $i return $ii.RR * $ii.RR) - avg(for $ii in $i >>>> return $ii.RR) * avg(for $ii in $i return $ii.RR))^(0.5)}; >>>> >>>> ​Now I have two things I am hoping to do but need help with: >>>> >>>> 1. For each 1-min time bin, remove the bpm values that are above the top >>>> 5% or below the bottom 5%. I thought about using the min/max function for a >>>> few times to achieve this, but realized that it was not a good idea because >>>> in each time bin, the number of instances are not always the same. So for >>>> each 1-min time bin, we do need to calculate the 5% and 95% threshold, and >>>> remove instances accordingly, which I don't know how to do. >>>> >>>> 2. After removing the outliers of bpm for each 1-min time bin, calculate a >>>> median absolute deviation (MAD) for each 1-min time bin (as another measure >>>> of variation besides the standard deviation). MAD = >>>> median(abs(x-median(x)). I'm not sure how to write a query to do the median >>>> function in Asterix. >>>> >>>> Thank you so much in advance. Let me know if my questions are clear. >>>> >>>> Yiran >>>> >>>> -- >>>> Best, >>>> Yiran >>>> >>>> -- >>>> You received this message because you are subscribed to the Google Groups >>>> "asterixdb-dev" group. >>>> To unsubscribe from this group and stop receiving emails from it, send an >>>> email toasterixdb-dev+unsubscribe@googlegroups.com >>>> . >>>> For more options, visithttps://groups.google.com/d/optout. >>>> >> >> >> >> >> -- >> Best, >> Yiran > > > > > -- > Best, > Yiran --------------090204090600060709060304--