Mailing-List: contact cassandra-user-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: cassandra-user@incubator.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <a1625f290910211230u61663252xd0a5a426e984924@mail.gmail.com>
References: <a1625f290910191551n41983840s3edc815d38f7009c@mail.gmail.com>
	 <e06563880910192020g78afda29l59ce6efd179c62ca@mail.gmail.com>
	 <a1625f290910192028m524d83acx8f218b90139c91ab@mail.gmail.com>
	 <e06563880910192030j65f5520oae3b5a7c8bb35473@mail.gmail.com>
	 <a1625f290910192045yae55d09xe403eb086b638b63@mail.gmail.com>
	 <e06563880910201139t5bd10d9te78b6a8c1c1a9fae@mail.gmail.com>
	 <a1625f290910201202s8d5920er49eaac2661220900@mail.gmail.com>
	 <a1625f290910211201m1f7304b5jd4f23d484af6d943@mail.gmail.com>
	 <e06563880910211207s2e3d1c4fyd795d1632bc26a05@mail.gmail.com>
	 <a1625f290910211230u61663252xd0a5a426e984924@mail.gmail.com>
Date: Wed, 21 Oct 2009 13:38:44 -0700
Message-ID: <a1625f290910211338u50048bc4xe45483d640300c18@mail.gmail.com>
Subject: Re: Encountering timeout exception when running get_key_range
From: Ramzi Rabah <rrabah@playdom.com>
To: cassandra-user@incubator.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hi Jonathan in what scenario will there be only one version of a row?
and for this scenario, does this mean that these tombstone records
will never ever be deleted?

Also on a higher level, what I am trying to do is provide some garbage
collection of entries in the cassandra hash that have expired.(Kind of
like time-to-live functionality for records in Cassandra). Do you have
any insights on how that can be accomplished.


On Wed, Oct 21, 2009 at 12:30 PM, Ramzi Rabah <rrabah@playdom.com> wrote:
> I opened https://issues.apache.org/jira/browse/CASSANDRA-507
>
> Ray
>
> On Wed, Oct 21, 2009 at 12:07 PM, Jonathan Ellis <jbellis@gmail.com> wrot=
e:
>> The compaction code removes tombstones, and it runs whenever you have
>> enough sstable fragments.
>>
>> I think I know what is happening -- as an optimization, if there is
>> only one version of a row it will just copy it to the new sstable.
>> This means it won't clean out tombstones.
>>
>> Can you file a bug at https://issues.apache.org/jira/browse/CASSANDRA ?
>>
>> -Jonathan
>>
>> On Wed, Oct 21, 2009 at 2:01 PM, Ramzi Rabah <rrabah@playdom.com> wrote:
>>> Hi Jonathan I am still running into the timeout issue even after
>>> reducing the GCGraceSeconds to 1 hour (we have tons of deletes
>>> happening in our app). Which part of Cassandra
>>> is responsible for deleting the tombstone records and how often does it=
 run.
>>>
>>>
>>> On Tue, Oct 20, 2009 at 12:02 PM, Ramzi Rabah <rrabah@playdom.com> wrot=
e:
>>>> Thank you so much Jonathan.
>>>>
>>>> Data is test data so I'll just wipe it out and restart after updating
>>>> GCGraceSeconds.
>>>> Thanks for your help.
>>>>
>>>> Ray
>>>>
>>>> On Tue, Oct 20, 2009 at 11:39 AM, Jonathan Ellis <jbellis@gmail.com> w=
rote:
>>>>> The problem is you have a few MB of actual data and a few hundred MB
>>>>> of tombstones (data marked deleted). =A0So what happens is get_key_ra=
nge
>>>>> spends a long, long time iterating through the tombstoned rows,
>>>>> looking for keys that actually still exist.
>>>>>
>>>>> We're going to redesign this for CASSANDRA-344, but for the 0.4
>>>>> series, you should restart with GCGraceSeconds much lower (e.g. 3600)=
,
>>>>> delete your old data files, and reload your data fresh. =A0(Instead o=
f
>>>>> reloading, you can use "nodeprobe compact" on each node to force a
>>>>> major compaction but it will take much longer since you have so many
>>>>> tombstones).
>>>>>
>>>>> -Jonathan
>>>>>
>>>>> On Mon, Oct 19, 2009 at 10:45 PM, Ramzi Rabah <rrabah@playdom.com> wr=
ote:
>>>>>> Hi Jonathan:
>>>>>>
>>>>>> Here is the storage_conf.xml for one of the servers
>>>>>> http://email.slicezero.com/storage-conf.xml
>>>>>>
>>>>>> and here is the zipped data:
>>>>>> http://email.slicezero.com/datastoreDeletion.tgz
>>>>>>
>>>>>> Thanks
>>>>>> Ray
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Oct 19, 2009 at 8:30 PM, Jonathan Ellis <jbellis@gmail.com> =
wrote:
>>>>>>> Yes, please. =A0You'll probably have to use something like
>>>>>>> http://www.getdropbox.com/ if you don't have a public web server to
>>>>>>> stash it temporarily.
>>>>>>>
>>>>>>> On Mon, Oct 19, 2009 at 10:28 PM, Ramzi Rabah <rrabah@playdom.com> =
wrote:
>>>>>>>> Hi Jonathan the data is about 60 MB. Would you like me to send it =
to you?
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Oct 19, 2009 at 8:20 PM, Jonathan Ellis <jbellis@gmail.com=
> wrote:
>>>>>>>>> Is the data on 6, 9, or 10 small enough that you could tar.gz it =
up
>>>>>>>>> for me to use to reproduce over here?
>>>>>>>>>
>>>>>>>>> On Mon, Oct 19, 2009 at 10:17 PM, Ramzi Rabah <rrabah@playdom.com=
> wrote:
>>>>>>>>>> So my cluster has 4 nodes node6, node8, node9 and node10. I turn=
ed
>>>>>>>>>> them all off.
>>>>>>>>>> 1- I started node6 by itself and still got the problem.
>>>>>>>>>> 2- I started node8 by itself and it ran fine (returned no keys)
>>>>>>>>>> 3- I started node9 by itself and still got the problem.
>>>>>>>>>> 4- I started node10 by itself and still got the problem.
>>>>>>>>>>
>>>>>>>>>> Ray
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Oct 19, 2009 at 7:44 PM, Jonathan Ellis <jbellis@gmail.c=
om> wrote:
>>>>>>>>>>> That's really strange... =A0Can you reproduce on a single-node =
cluster?
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Oct 19, 2009 at 9:34 PM, Ramzi Rabah <rrabah@playdom.co=
m> wrote:
>>>>>>>>>>>> The rows are very small. There are a handful of columns per ro=
w
>>>>>>>>>>>> (approximately about 4-5 columns per row).
>>>>>>>>>>>> Each column has a name which is a String (20-30 characters lon=
g), and
>>>>>>>>>>>> the value is an empty array of bytes (new byte[0]).
>>>>>>>>>>>> I just use the names of the columns, and don't need to store a=
ny
>>>>>>>>>>>> values in this Column Family.
>>>>>>>>>>>>
>>>>>>>>>>>> -- Ray
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Oct 19, 2009 at 7:24 PM, Jonathan Ellis <jbellis@gmail=
.com> wrote:
>>>>>>>>>>>>> Can you tell me anything about the nature of your rows? =A0Ma=
ny/few
>>>>>>>>>>>>> columns? =A0Large/small column values?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Oct 19, 2009 at 9:17 PM, Ramzi Rabah <rrabah@playdom.=
com> wrote:
>>>>>>>>>>>>>> Hi Jonathan
>>>>>>>>>>>>>> I actually spoke too early. Now even if I restart the server=
s it still
>>>>>>>>>>>>>> gives a timeout exception.
>>>>>>>>>>>>>> As far as the sstable files are, not sure which ones are the=
 sstables,
>>>>>>>>>>>>>> but here is the list of files in the data directory that are=
 prepended
>>>>>>>>>>>>>> with the column family name:
>>>>>>>>>>>>>> DatastoreDeletionSchedule-1-Data.db
>>>>>>>>>>>>>> DatastoreDeletionSchedule-1-Filter.db
>>>>>>>>>>>>>> DatastoreDeletionSchedule-1-Index.db
>>>>>>>>>>>>>> DatastoreDeletionSchedule-5-Data.db
>>>>>>>>>>>>>> DatastoreDeletionSchedule-5-Filter.db
>>>>>>>>>>>>>> DatastoreDeletionSchedule-5-Index.db
>>>>>>>>>>>>>> DatastoreDeletionSchedule-7-Data.db
>>>>>>>>>>>>>> DatastoreDeletionSchedule-7-Filter.db
>>>>>>>>>>>>>> DatastoreDeletionSchedule-7-Index.db
>>>>>>>>>>>>>> DatastoreDeletionSchedule-8-Data.db
>>>>>>>>>>>>>> DatastoreDeletionSchedule-8-Filter.db
>>>>>>>>>>>>>> DatastoreDeletionSchedule-8-Index.db
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am not currently doing any system stat collection.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Oct 19, 2009 at 6:41 PM, Jonathan Ellis <jbellis@gma=
il.com> wrote:
>>>>>>>>>>>>>>> How many sstable files are in the data directories for the
>>>>>>>>>>>>>>> columnfamily you are querying?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> How many are there after you restart and it is happy?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Are you doing system stat collection with munin or ganglia =
or some such?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Oct 19, 2009 at 8:25 PM, Ramzi Rabah <rrabah@playdo=
m.com> wrote:
>>>>>>>>>>>>>>>> Hi Jonathan I updated to 4.1 and I still get the same exce=
ption when I
>>>>>>>>>>>>>>>> call get_key_range.
>>>>>>>>>>>>>>>> I checked all the server logs, and there is only one excep=
tion being
>>>>>>>>>>>>>>>> thrown by whichever server I am connecting to.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>> Ray
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, Oct 19, 2009 at 4:52 PM, Jonathan Ellis <jbellis@g=
mail.com> wrote:
>>>>>>>>>>>>>>>>> No, it's smart enough to avoid scanning.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Mon, Oct 19, 2009 at 6:49 PM, Ramzi Rabah <rrabah@play=
dom.com> wrote:
>>>>>>>>>>>>>>>>>> Hi Jonathan thanks for the reply, I will update the code=
 to 0.4.1 and
>>>>>>>>>>>>>>>>>> will check all the logs on all the machines.
>>>>>>>>>>>>>>>>>> Just a simple question, when you do a get_key_range and =
you specify ""
>>>>>>>>>>>>>>>>>> and "" for start and end, and the limit is 25, if there =
are too many
>>>>>>>>>>>>>>>>>> entries, does it do a scan to find out the start or is i=
t smart enough
>>>>>>>>>>>>>>>>>> to know what the start key is?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Mon, Oct 19, 2009 at 4:42 PM, Jonathan Ellis <jbellis=
@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>> You should check the other nodes for potential exceptio=
ns keeping them
>>>>>>>>>>>>>>>>>>> from replying.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Without seeing that it's hard to say if this is caused =
by an old bug,
>>>>>>>>>>>>>>>>>>> but you should definitely upgrade to 0.4.1 either way :=
)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Mon, Oct 19, 2009 at 5:51 PM, Ramzi Rabah <rrabah@pl=
aydom.com> wrote:
>>>>>>>>>>>>>>>>>>>> Hello all,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I am running into problems with get_key_range. I have
>>>>>>>>>>>>>>>>>>>> OrderPreservingPartitioner defined in storage-conf.xml=
 and I am using
>>>>>>>>>>>>>>>>>>>> a columnfamily that looks like
>>>>>>>>>>>>>>>>>>>> =A0 =A0 <ColumnFamily CompareWith=3D"BytesType"
>>>>>>>>>>>>>>>>>>>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 Name=3D"DatastoreD=
eletionSchedule"
>>>>>>>>>>>>>>>>>>>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 />
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> My command is client.get_key_range("Keyspace1", "Datas=
toreDeletionSchedule",
>>>>>>>>>>>>>>>>>>>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0"", "", 25, Con=
sistencyLevel.ONE);
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> It usually works fine but after a day or so from serve=
r writes into
>>>>>>>>>>>>>>>>>>>> this column family, I started getting
>>>>>>>>>>>>>>>>>>>> ERROR [pool-1-thread-36] 2009-10-19 17:24:28,223 Cassa=
ndra.java (line
>>>>>>>>>>>>>>>>>>>> 770) Internal error processing get_key_range
>>>>>>>>>>>>>>>>>>>> java.lang.RuntimeException: java.util.concurrent.Timeo=
utException:
>>>>>>>>>>>>>>>>>>>> Operation timed out.
>>>>>>>>>>>>>>>>>>>> =A0 =A0 =A0 =A0at org.apache.cassandra.service.Storage=
Proxy.getKeyRange(StorageProxy.java:560)
>>>>>>>>>>>>>>>>>>>> =A0 =A0 =A0 =A0at org.apache.cassandra.service.Cassand=
raServer.get_key_range(CassandraServer.java:595)
>>>>>>>>>>>>>>>>>>>> =A0 =A0 =A0 =A0at org.apache.cassandra.service.Cassand=
ra$Processor$get_key_range.process(Cassandra.java:766)
>>>>>>>>>>>>>>>>>>>> =A0 =A0 =A0 =A0at org.apache.cassandra.service.Cassand=
ra$Processor.process(Cassandra.java:609)
>>>>>>>>>>>>>>>>>>>> =A0 =A0 =A0 =A0at org.apache.thrift.server.TThreadPool=
Server$WorkerProcess.run(TThreadPoolServer.java:253)
>>>>>>>>>>>>>>>>>>>> =A0 =A0 =A0 =A0at java.util.concurrent.ThreadPoolExecu=
tor$Worker.runTask(ThreadPoolExecutor.java:885)
>>>>>>>>>>>>>>>>>>>> =A0 =A0 =A0 =A0at java.util.concurrent.ThreadPoolExecu=
tor$Worker.run(ThreadPoolExecutor.java:907)
>>>>>>>>>>>>>>>>>>>> =A0 =A0 =A0 =A0at java.lang.Thread.run(Thread.java:619=
)
>>>>>>>>>>>>>>>>>>>> Caused by: java.util.concurrent.TimeoutException: Oper=
ation timed out.
>>>>>>>>>>>>>>>>>>>> =A0 =A0 =A0 =A0at org.apache.cassandra.net.AsyncResult=
.get(AsyncResult.java:97)
>>>>>>>>>>>>>>>>>>>> =A0 =A0 =A0 =A0at org.apache.cassandra.service.Storage=
Proxy.getKeyRange(StorageProxy.java:556)
>>>>>>>>>>>>>>>>>>>> =A0 =A0 =A0 =A0... 7 more
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I still get the timeout exceptions even though the ser=
vers have been
>>>>>>>>>>>>>>>>>>>> idle for 2 days. When I restart the cassandra servers,=
 it seems to
>>>>>>>>>>>>>>>>>>>> work fine again. Any ideas what could be wrong?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> By the way, I am using version:apache-cassandra-incuba=
ting-0.4.0-rc2
>>>>>>>>>>>>>>>>>>>> Not sure if this is fixed in the 0.4.1 version
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>> Ray
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>