Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
From: Robert Stupp <snazy@snazy.de>
Content-Type: multipart/alternative;
 boundary="Apple-Mail=_318C2BD8-486A-4C94-9504-48B6CF3546D6"
Message-Id: <3A84C198-7837-488E-AE14-CA96752B2D96@snazy.de>
Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2102\))
Subject: Re: Cassandra: UDF
Date: Wed, 5 Aug 2015 13:02:42 +0200
References: 
 <CABVuQoVLR2oN34echUJzHfuPjo0XW1Xhc80QZbqBEVzqEpJdsA@mail.gmail.com>
To: user@cassandra.apache.org
In-Reply-To: 
 <CABVuQoVLR2oN34echUJzHfuPjo0XW1Xhc80QZbqBEVzqEpJdsA@mail.gmail.com>


--Apple-Mail=_318C2BD8-486A-4C94-9504-48B6CF3546D6
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=utf-8

Suresh,

tip: you can use alternative (=E2=80=9Dpg-style=E2=80=9D) string =
delimiters, which can span over multiple lines and makes the CQL =
statement much nicer:
CREATE OR REPLACE FUNCTION state_groupbyandsum (
	state map<text, double>, datetime text, amount text )
CALLED ON NULL INPUT
RETURNS map<text, double>
LANGUAGE java=20
AS $$
	String date =3D datetime.substring(0,10);
	Double count =3D (Double) state.get(date);
...
	return state;
$$ ;

UDAs are best suited for queries against a single partition - not =
against a possibly really huge table.
This is nothing special for UDAs as you should always code your queries =
to hit a single partition.

User defined aggregates are not meant to do the job of (or even replace) =
an analytics framework like Apache Spark.
Frankly, Top-K-queries over a big data set are best suited for Spark =
using the Cassandra-Spark-Connector.

In your case: imagine your query returns 1B rows - all that information =
must be held in the map in the Java heap of the coordinator (the node =
that runs the UDA).

You can do Top-K query with UDAs over the whole table - and rely on the =
fact that rows passed to the state function are grouped by their =
partition key (assuming that =E2=80=98datetime=E2=80=99 is in your =
partition key) AND kicking datetime values out of your state-map that do =
not match the Top-K criteria.
BUT: I do NOT recommend to do that upon user request - instead in a =
batch job and pipe the result in another table for fast read access.

Robert


> On 05 Aug 2015, at 12:09, Suresh Mahawar =
<suresh.mahawar@technocube.in> wrote:
>=20
> Hi,
>=20
> I need your help. I have a query which get top 5 records group by date =
(not date + time) and sum of amount.
>=20
> I wrote the following but it returns all the records not just top 5 =
records
>=20
> CREATE OR REPLACE FUNCTION state_groupbyandsum( state map<text, =
double>, datetime text, amount text )
> CALLED ON NULL INPUT
> RETURNS map<text, double>
> LANGUAGE java=20
> AS 'String date =3D datetime.substring(0,10); Double count =3D =
(Double) state.get(date);  if (count =3D=3D null) count =3D =
Double.parseDouble(amount); else count =3D count +  =
Double.parseDouble(amount); state.put(date, count); return state;' ;
>=20
>=20
> CREATE OR REPLACE AGGREGATE groupbyandsum(text, text)=20
> SFUNC state_groupbyandsum
> STYPE map<text, double>
> INITCOND {};
>=20
> select groupbyandsum(datetime, amout) from warehouse;
>=20
> Could you please help out to get just 5 records.
>=20
>=20
> Thanks & Regards,
> Suresh Mahawar
> TechnoCube
> Find Me on Linkedin =
<https://www.linkedin.com/pub/suresh-mahawar/2a/b9/a80>
=E2=80=94
Robert Stupp
@snazy


--Apple-Mail=_318C2BD8-486A-4C94-9504-48B6CF3546D6
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=utf-8

<html><head><meta http-equiv=3D"Content-Type" content=3D"text/html =
charset=3Dutf-8"></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" =
class=3D""><div class=3D"">Suresh,</div><div class=3D""><br =
class=3D""></div><div class=3D"">tip: you can use alternative =
(=E2=80=9Dpg-style=E2=80=9D) string delimiters, which can span over =
multiple lines and makes the CQL statement much nicer:</div><div =
class=3D""><div style=3D"margin: 0px; font-family: Courier;" =
class=3D""><font size=3D"2" class=3D"">CREATE OR REPLACE FUNCTION =
state_groupbyandsum (</font></div><div style=3D"margin: 0px; =
font-family: Courier;" class=3D""><font size=3D"2" class=3D""><span =
class=3D"Apple-tab-span" style=3D"white-space:pre">	</span>state =
map&lt;text, double&gt;, datetime text, amount text )</font></div><div =
style=3D"margin: 0px; font-family: Courier;" class=3D""><font size=3D"2" =
class=3D"">CALLED ON NULL INPUT</font></div><div style=3D"margin: 0px; =
font-family: Courier;" class=3D""><font size=3D"2" class=3D"">RETURNS =
map&lt;text, double&gt;</font></div><div style=3D"margin: 0px; =
font-family: Courier;" class=3D""><font size=3D"2" class=3D"">LANGUAGE =
java&nbsp;</font></div><div style=3D"margin: 0px; font-family: Courier;" =
class=3D""><font size=3D"2" class=3D"">AS $$</font></div><div =
style=3D"margin: 0px; font-family: Courier;" class=3D""><font size=3D"2" =
class=3D""><span class=3D"Apple-tab-span" style=3D"white-space:pre">	=
</span>String date =3D datetime.substring(0,10);</font></div><div =
style=3D"margin: 0px; font-family: Courier;" class=3D""><font size=3D"2" =
class=3D""><span class=3D"Apple-tab-span" style=3D"white-space:pre">	=
</span>Double count =3D (Double) state.get(date);</font></div><div =
style=3D"margin: 0px; font-family: Courier;" class=3D""><font size=3D"2" =
class=3D"">...</font></div><div style=3D"margin: 0px; font-family: =
Courier;" class=3D""><font size=3D"2" class=3D""><span =
class=3D"Apple-tab-span" style=3D"white-space:pre">	</span>return =
state;</font></div><div style=3D"margin: 0px; font-family: Courier;" =
class=3D""><font size=3D"2" class=3D"">$$ ;</font></div></div><div =
class=3D""><br class=3D""></div><div class=3D"">UDAs are best suited for =
queries against a single partition - not against a possibly really huge =
table.</div><div class=3D""><div class=3D"">This is nothing special for =
UDAs as you should always code your queries to hit a single =
partition.</div></div><div class=3D""><br class=3D""></div><div =
class=3D""><div class=3D"">User defined aggregates are not meant to do =
the job of (or even replace) an analytics framework like Apache =
Spark.</div></div><div class=3D"">Frankly, Top-K-queries over a big data =
set are best suited for Spark using the =
Cassandra-Spark-Connector.</div><div class=3D""><br class=3D""></div><div =
class=3D"">In your case: imagine your query returns 1B rows - all that =
information must be held in the map in the Java heap of the coordinator =
(the node that runs the UDA).</div><div class=3D""><br =
class=3D""></div><div class=3D"">You can do Top-K query with UDAs over =
the whole table - and rely on the fact that rows passed to the state =
function are grouped by their partition key (assuming that =
=E2=80=98datetime=E2=80=99 is in your partition key) AND kicking =
datetime values out of your state-map that do not match the Top-K =
criteria.</div><div class=3D"">BUT: I do NOT recommend to do that upon =
user request - instead in a batch job and pipe the result in another =
table for fast read access.</div><div class=3D""><br class=3D""></div><div=
 class=3D"">Robert</div><div class=3D""><br class=3D""></div><br =
class=3D""><div><blockquote type=3D"cite" class=3D""><div class=3D"">On =
05 Aug 2015, at 12:09, Suresh Mahawar &lt;<a =
href=3D"mailto:suresh.mahawar@technocube.in" =
class=3D"">suresh.mahawar@technocube.in</a>&gt; wrote:</div><br =
class=3D"Apple-interchange-newline"><div class=3D""><div dir=3D"ltr" =
class=3D""><div style=3D"font-size:12.8000001907349px" class=3D""><div =
class=3D"">Hi,</div><div class=3D""><br class=3D""></div><div class=3D"">I=
 need your help. I have a query which get top 5 records group by date =
(not date + time) and sum of amount.</div><div class=3D""><br =
class=3D""></div><div class=3D"">I wrote the following but it returns =
all the records not just top 5 records</div><div class=3D""><br =
class=3D""></div><div class=3D""><span =
style=3D"background-color:rgb(204,204,204)" class=3D"">CREATE OR REPLACE =
FUNCTION state_groupbyandsum( state map&lt;text, double&gt;, datetime =
text, amount text )</span></div><div class=3D""><span =
style=3D"background-color:rgb(204,204,204)" class=3D"">CALLED ON NULL =
INPUT</span></div><div class=3D""><span =
style=3D"background-color:rgb(204,204,204)" class=3D"">RETURNS =
map&lt;text, double&gt;</span></div><div class=3D""><span =
style=3D"background-color:rgb(204,204,204)" class=3D"">LANGUAGE =
java&nbsp;</span></div><div class=3D""><span =
style=3D"background-color:rgb(204,204,204)" class=3D"">AS 'String date =3D=
 datetime.substring(0,10); Double count =3D (Double) state.get(date); =
&nbsp;if (count =3D=3D null) count =3D Double.parseDouble(amount); else =
count =3D count + &nbsp;Double.parseDouble(amount); state.put(date, =
count); return state;' ;</span></div><div class=3D""><br =
class=3D""></div><div class=3D""><br class=3D""></div><div =
class=3D""><span style=3D"background-color:rgb(204,204,204)" =
class=3D"">CREATE OR REPLACE AGGREGATE groupbyandsum(text, =
text)&nbsp;</span></div><div class=3D""><span =
style=3D"background-color:rgb(204,204,204)" class=3D"">SFUNC =
state_groupbyandsum</span></div><div class=3D""><span =
style=3D"background-color:rgb(204,204,204)" class=3D"">STYPE =
map&lt;text, double&gt;</span></div><div class=3D""><span =
style=3D"background-color:rgb(204,204,204)" class=3D"">INITCOND =
{};</span></div></div><div style=3D"font-size:12.8000001907349px" =
class=3D""><br class=3D""></div><div =
style=3D"font-size:12.8000001907349px" class=3D""><span =
style=3D"background-color:rgb(204,204,204)" class=3D"">select =
groupbyandsum(datetime, amout) from warehouse;</span><br =
class=3D""></div><div style=3D"font-size:12.8000001907349px" =
class=3D""><br class=3D""></div><div =
style=3D"font-size:12.8000001907349px" class=3D"">Could you please help =
out to get just 5 records.</div><div class=3D""><br class=3D""></div><div =
class=3D""><br class=3D""></div><div class=3D""><div =
class=3D"gmail_signature"><div dir=3D"ltr" class=3D""><div class=3D""><div=
 dir=3D"ltr" class=3D"">Thanks &amp; Regards,<div class=3D"">Suresh =
Mahawar</div><div class=3D"">TechnoCube</div><div class=3D"">Find Me =
on&nbsp;<a href=3D"https://www.linkedin.com/pub/suresh-mahawar/2a/b9/a80" =
target=3D"_blank" =
class=3D"">Linkedin</a></div></div></div></div></div></div>
</div>
</div></blockquote></div><br class=3D""><div class=3D"">
<div style=3D"color: rgb(0, 0, 0); letter-spacing: normal; orphans: =
auto; text-align: start; text-indent: 0px; text-transform: none; =
white-space: normal; widows: auto; word-spacing: 0px; =
-webkit-text-stroke-width: 0px; word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" =
class=3D""><div class=3D"">=E2=80=94</div><div class=3D"">Robert =
Stupp</div><div class=3D"">@snazy</div></div>
</div>
<br class=3D""></body></html>=

--Apple-Mail=_318C2BD8-486A-4C94-9504-48B6CF3546D6--