Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id AB90C184CD for ; Wed, 5 Aug 2015 11:05:12 +0000 (UTC) Received: (qmail 73718 invoked by uid 500); 5 Aug 2015 11:05:09 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 73679 invoked by uid 500); 5 Aug 2015 11:05:09 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 73669 invoked by uid 99); 5 Aug 2015 11:05:09 -0000 Received: from Unknown (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 05 Aug 2015 11:05:09 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 1DC08C0045 for ; Wed, 5 Aug 2015 11:05:09 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.653 X-Spam-Level: *** X-Spam-Status: No, score=3.653 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=3, SPF_NEUTRAL=0.652, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id VUYDA64xKabs for ; Wed, 5 Aug 2015 11:04:58 +0000 (UTC) Received: from wp051.webpack.hosteurope.de (wp051.webpack.hosteurope.de [80.237.132.58]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id E782A42B14 for ; Wed, 5 Aug 2015 11:04:57 +0000 (UTC) Received: from [195.227.81.222] (helo=[10.8.0.186]); authenticated by wp051.webpack.hosteurope.de running ExIM with esmtpsa (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) id 1ZMwT5-0006KX-Hd; Wed, 05 Aug 2015 13:02:43 +0200 From: Robert Stupp Content-Type: multipart/alternative; boundary="Apple-Mail=_318C2BD8-486A-4C94-9504-48B6CF3546D6" Message-Id: <3A84C198-7837-488E-AE14-CA96752B2D96@snazy.de> Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2102\)) Subject: Re: Cassandra: UDF Date: Wed, 5 Aug 2015 13:02:42 +0200 References: To: user@cassandra.apache.org In-Reply-To: X-Mailer: Apple Mail (2.2102) X-bounce-key: webpack.hosteurope.de;snazy@snazy.de;1438772697;32f905b2; --Apple-Mail=_318C2BD8-486A-4C94-9504-48B6CF3546D6 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 Suresh, tip: you can use alternative (=E2=80=9Dpg-style=E2=80=9D) string = delimiters, which can span over multiple lines and makes the CQL = statement much nicer: CREATE OR REPLACE FUNCTION state_groupbyandsum ( state map, datetime text, amount text ) CALLED ON NULL INPUT RETURNS map LANGUAGE java=20 AS $$ String date =3D datetime.substring(0,10); Double count =3D (Double) state.get(date); ... return state; $$ ; UDAs are best suited for queries against a single partition - not = against a possibly really huge table. This is nothing special for UDAs as you should always code your queries = to hit a single partition. User defined aggregates are not meant to do the job of (or even replace) = an analytics framework like Apache Spark. Frankly, Top-K-queries over a big data set are best suited for Spark = using the Cassandra-Spark-Connector. In your case: imagine your query returns 1B rows - all that information = must be held in the map in the Java heap of the coordinator (the node = that runs the UDA). You can do Top-K query with UDAs over the whole table - and rely on the = fact that rows passed to the state function are grouped by their = partition key (assuming that =E2=80=98datetime=E2=80=99 is in your = partition key) AND kicking datetime values out of your state-map that do = not match the Top-K criteria. BUT: I do NOT recommend to do that upon user request - instead in a = batch job and pipe the result in another table for fast read access. Robert > On 05 Aug 2015, at 12:09, Suresh Mahawar = wrote: >=20 > Hi, >=20 > I need your help. I have a query which get top 5 records group by date = (not date + time) and sum of amount. >=20 > I wrote the following but it returns all the records not just top 5 = records >=20 > CREATE OR REPLACE FUNCTION state_groupbyandsum( state map, datetime text, amount text ) > CALLED ON NULL INPUT > RETURNS map > LANGUAGE java=20 > AS 'String date =3D datetime.substring(0,10); Double count =3D = (Double) state.get(date); if (count =3D=3D null) count =3D = Double.parseDouble(amount); else count =3D count + = Double.parseDouble(amount); state.put(date, count); return state;' ; >=20 >=20 > CREATE OR REPLACE AGGREGATE groupbyandsum(text, text)=20 > SFUNC state_groupbyandsum > STYPE map > INITCOND {}; >=20 > select groupbyandsum(datetime, amout) from warehouse; >=20 > Could you please help out to get just 5 records. >=20 >=20 > Thanks & Regards, > Suresh Mahawar > TechnoCube > Find Me on Linkedin = =E2=80=94 Robert Stupp @snazy --Apple-Mail=_318C2BD8-486A-4C94-9504-48B6CF3546D6 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8
Suresh,

tip: you can use alternative = (=E2=80=9Dpg-style=E2=80=9D) string delimiters, which can span over = multiple lines and makes the CQL statement much nicer:
CREATE OR REPLACE FUNCTION = state_groupbyandsum (
state = map<text, double>, datetime text, amount text )
CALLED ON NULL INPUT
RETURNS = map<text, double>
LANGUAGE = java 
AS $$
= String date =3D datetime.substring(0,10);
= Double count =3D (Double) state.get(date);
...
return = state;
$$ ;

UDAs are best suited for = queries against a single partition - not against a possibly really huge = table.
This is nothing special for = UDAs as you should always code your queries to hit a single = partition.

User defined aggregates are not meant to do = the job of (or even replace) an analytics framework like Apache = Spark.
Frankly, Top-K-queries over a big data = set are best suited for Spark using the = Cassandra-Spark-Connector.

In your case: imagine your query returns 1B rows - all that = information must be held in the map in the Java heap of the coordinator = (the node that runs the UDA).

You can do Top-K query with UDAs over = the whole table - and rely on the fact that rows passed to the state = function are grouped by their partition key (assuming that = =E2=80=98datetime=E2=80=99 is in your partition key) AND kicking = datetime values out of your state-map that do not match the Top-K = criteria.
BUT: I do NOT recommend to do that upon = user request - instead in a batch job and pipe the result in another = table for fast read access.

Robert


On = 05 Aug 2015, at 12:09, Suresh Mahawar <suresh.mahawar@technocube.in> wrote:

Hi,

I= need your help. I have a query which get top 5 records group by date = (not date + time) and sum of amount.

I wrote the following but it returns = all the records not just top 5 records

CREATE OR REPLACE = FUNCTION state_groupbyandsum( state map<text, double>, datetime = text, amount text )
CALLED ON NULL = INPUT
RETURNS = map<text, double>
LANGUAGE = java 
AS 'String date =3D= datetime.substring(0,10); Double count =3D (Double) state.get(date); =  if (count =3D=3D null) count =3D Double.parseDouble(amount); else = count =3D count +  Double.parseDouble(amount); state.put(date, = count); return state;' ;


CREATE OR REPLACE AGGREGATE groupbyandsum(text, = text) 
SFUNC = state_groupbyandsum
STYPE = map<text, double>
INITCOND = {};

select = groupbyandsum(datetime, amout) from warehouse;

Could you please help = out to get just 5 records.


Thanks & Regards,
Suresh = Mahawar
TechnoCube
Find Me = on Linkedin

=E2=80=94
Robert = Stupp
@snazy

= --Apple-Mail=_318C2BD8-486A-4C94-9504-48B6CF3546D6--