Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <i2re06563881004300622m81b1ee96wc5a819bc2865dea7@mail.gmail.com>
References: <h2r709d5bba1004290914vd93de26ai7f90b3312bd3bc3f@mail.gmail.com>
	<v2le06563881004291432k84331b03v747e2b934935a8c4@mail.gmail.com>
	<v2i709d5bba1004291604o5c89ef40kc3c263cb37d0e99c@mail.gmail.com>
	<i2re06563881004300622m81b1ee96wc5a819bc2865dea7@mail.gmail.com>
From: =?UTF-8?Q?Utku_Can_Top=C3=A7u?= <utku@topcu.gen.tr>
Date: Fri, 30 Apr 2010 16:08:32 +0200
Message-ID: <k2h709d5bba1004300708i87561ac0ua2cf5495d0d2beee@mail.gmail.com>
Subject: Re: ColumnFamilyInputFormat KeyRange scans on a CF
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=0016e6d7ea6a83442e048574c8ed

--0016e6d7ea6a83442e048574c8ed
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Do you mean, running the get_range_slices from a single? Yes, it would be
reasonable for a relatively small key range, when it comes to analyze a
really big range in really big data collection (i.e. like the one we
currently populate) having a way for distributing the reads among the
cluster seems the only reasonable solution.

In this current situation, the best option might be distributing the range
among ColumnFamilies (say, 1 CF for each day) and emptying the CF in order
to reuse for another day range after analyzing the data.

Can you suggest a workaround for this?

On Fri, Apr 30, 2010 at 3:22 PM, Jonathan Ellis <jbellis@gmail.com> wrote:

> Sounds like doing this w/o m/r with get_range_slices is a reasonable way =
to
> go.
>
> On Thu, Apr 29, 2010 at 6:04 PM, Utku Can Top=C3=A7u <utku@topcu.gen.tr> =
wrote:
> > I'm currently writing collected data continuously to Cassandra, having
> keys
> > starting with a timestamp and a unique identifier (like
> > 2009.01.01.00.00.00.RANDOM) for being able to query in time ranges.
> >
> > I'm thinking of running periodical mapreduce jobs which will go through=
 a
> > designated time period. I might want to analyze the data only between
> > 2009.01 and 2009.02.
> > I had done this previously with HBase however I thought cassandra would
> be a
> > better choice for continuously storing data in a safe manner.
> >
> > I guess this briefly explains my designated use case.
> >
> > Best Regards,
> > Utku
> >
> > On Thu, Apr 29, 2010 at 11:32 PM, Jonathan Ellis <jbellis@gmail.com>
> wrote:
> >>
> >> It's technically possible but 0.6 does not support this, no.
> >>
> >> What is the use case you are thinking of?
> >>
> >> On Thu, Apr 29, 2010 at 11:14 AM, Utku Can Top=C3=A7u <utku@topcu.gen.=
tr>
> >> wrote:
> >> > Hi,
> >> >
> >> > I've been trying to use Cassandra for some kind of a supplementary
> input
> >> > source for Hadoop MapReduce jobs.
> >> >
> >> > The default usage of the ColumnFamilyInputFormat does a full
> >> > columnfamily
> >> > scan for using within the MapReduce framework as map input.
> >> >
> >> > However I believe that, it should be possible to give a keyrange to
> scan
> >> > the
> >> > specified range.
> >> >
> >> > Is it anymeans possible?
> >> >
> >> > Best Regards,
> >> >
> >> > Utku
> >>
> >> --
> >> Jonathan Ellis
> >> Project Chair, Apache Cassandra
> >> co-founder of Riptano, the source for professional Cassandra support
> >> http://riptano.com
> >
> >
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>

--0016e6d7ea6a83442e048574c8ed
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Do you mean, running the get_range_slices from a single? Yes, it would be r=
easonable for a relatively small key range, when it comes to analyze a real=
ly big range in really big data collection (i.e. like the one we currently =
populate) having a way for distributing the reads among the cluster seems t=
he only reasonable solution.<br>

<br>In this current situation, the best option might be distributing the ra=
nge among ColumnFamilies (say, 1 CF for each day) and emptying the CF in or=
der to reuse for another day range after analyzing the data.<br><br>Can you=
 suggest a workaround for this?<br>

<br><div class=3D"gmail_quote">On Fri, Apr 30, 2010 at 3:22 PM, Jonathan El=
lis <span dir=3D"ltr">&lt;<a href=3D"mailto:jbellis@gmail.com">jbellis@gmai=
l.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"m=
argin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); paddin=
g-left: 1ex;">

Sounds like doing this w/o m/r with get_range_slices is a reasonable way to=
 go.<br>
<div><div></div><div class=3D"h5"><br>
On Thu, Apr 29, 2010 at 6:04 PM, Utku Can Top=C3=A7u &lt;<a href=3D"mailto:=
utku@topcu.gen.tr">utku@topcu.gen.tr</a>&gt; wrote:<br>
&gt; I&#39;m currently writing collected data continuously to Cassandra, ha=
ving keys<br>
&gt; starting with a timestamp and a unique identifier (like<br>
&gt; 2009.01.01.00.00.00.RANDOM) for being able to query in time ranges.<br=
>
&gt;<br>
&gt; I&#39;m thinking of running periodical mapreduce jobs which will go th=
rough a<br>
&gt; designated time period. I might want to analyze the data only between<=
br>
&gt; 2009.01 and 2009.02.<br>
&gt; I had done this previously with HBase however I thought cassandra woul=
d be a<br>
&gt; better choice for continuously storing data in a safe manner.<br>
&gt;<br>
&gt; I guess this briefly explains my designated use case.<br>
&gt;<br>
&gt; Best Regards,<br>
&gt; Utku<br>
&gt;<br>
&gt; On Thu, Apr 29, 2010 at 11:32 PM, Jonathan Ellis &lt;<a href=3D"mailto=
:jbellis@gmail.com">jbellis@gmail.com</a>&gt; wrote:<br>
&gt;&gt;<br>
&gt;&gt; It&#39;s technically possible but 0.6 does not support this, no.<b=
r>
&gt;&gt;<br>
&gt;&gt; What is the use case you are thinking of?<br>
&gt;&gt;<br>
&gt;&gt; On Thu, Apr 29, 2010 at 11:14 AM, Utku Can Top=C3=A7u &lt;<a href=
=3D"mailto:utku@topcu.gen.tr">utku@topcu.gen.tr</a>&gt;<br>
&gt;&gt; wrote:<br>
&gt;&gt; &gt; Hi,<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; I&#39;ve been trying to use Cassandra for some kind of a supp=
lementary input<br>
&gt;&gt; &gt; source for Hadoop MapReduce jobs.<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; The default usage of the ColumnFamilyInputFormat does a full<=
br>
&gt;&gt; &gt; columnfamily<br>
&gt;&gt; &gt; scan for using within the MapReduce framework as map input.<b=
r>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; However I believe that, it should be possible to give a keyra=
nge to scan<br>
&gt;&gt; &gt; the<br>
&gt;&gt; &gt; specified range.<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; Is it anymeans possible?<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; Best Regards,<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; Utku<br>
&gt;&gt;<br>
&gt;&gt; --<br>
&gt;&gt; Jonathan Ellis<br>
&gt;&gt; Project Chair, Apache Cassandra<br>
&gt;&gt; co-founder of Riptano, the source for professional Cassandra suppo=
rt<br>
&gt;&gt; <a href=3D"http://riptano.com" target=3D"_blank">http://riptano.co=
m</a><br>
&gt;<br>
&gt;<br>
<br>
<br>
<br>
</div></div>--<br>
<div><div></div><div class=3D"h5">Jonathan Ellis<br>
Project Chair, Apache Cassandra<br>
co-founder of Riptano, the source for professional Cassandra support<br>
<a href=3D"http://riptano.com" target=3D"_blank">http://riptano.com</a><br>
</div></div></blockquote></div><br>

--0016e6d7ea6a83442e048574c8ed--