Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@couchdb.apache.org
Received-SPF: pass (nike.apache.org: domain of adam.kocoloski@gmail.com
 designates 209.85.216.173 as permitted sender)
Content-Type: text/plain; charset=us-ascii
Mime-Version: 1.0 (Apple Message framework v1082)
Subject: Re: Compact not completing
From: Adam Kocoloski <kocolosk@apache.org>
In-Reply-To: <20110102024357.gkhnfjeu3o4cskwk@webmail.loop.com.br>
Date: Sun, 2 Jan 2011 17:06:36 -0500
Content-Transfer-Encoding: quoted-printable
Message-Id: <1AF66366-0D6F-41D5-8C8C-4D88BCAB70F9@apache.org>
References: <20101231123818.dehfe3vdq8ww8csg@webmail.loop.com.br>
 <AANLkTinJKd0wp9KxmDSg-NLq3q5hqJvZmCZeaJsBL1+2@mail.gmail.com>
 <AANLkTinf+Ocg933UjgQS=g6V0pN3UruVnDgjZEHDvhZH@mail.gmail.com>
 <20110101114721.vyoefauts0woc8gg@webmail.loop.com.br>
 <E4B138E6-C058-42F9-8438-272E7ED6B5CE@apache.org>
 <20110102024357.gkhnfjeu3o4cskwk@webmail.loop.com.br>
To: user@couchdb.apache.org

Ah, Mike, I didn't get the instructions right in step 1.  Sorry about =
that.  What you really want are the last 1000 Ids in the seq_tree prior =
to the compactor crash.  So maybe something like

GET /iris/_changes?descending=3Dtrue&limit=3D1000&since=3D96282148

Regards, Adam

On Jan 2, 2011, at 12:43 AM, mike@loop.com.br wrote:

> Adam,
>=20
> Thanks for an excellent explanation. It was easy to find the culprit:
>=20
> curl -s =
'172.17.17.3:5984/iris/_changes?since=3D96281148&limit=3D1000&include_docs=
=3Dtrue' | grep -v time
> {"results":[
> =
{"seq":96281622,"id":"1292252400F7005","changes":[{"rev":"2-d94be4c93931a3=
5524b3f34b9de41a11"}],"deleted":true,"doc":{"_id":"1292252400F7005","_rev"=
:"2-d94be4c93931a35524b3f34b9de41a11","_deleted":true}},
> ],
> "last_seq":96282306}
>=20
> The problem I have is that the document exists with different rev and =
is not
> deleted:
>=20
> curl -s '172.17.17.3:5984/iris/1292252400F7005'
> {"_id":"1292252400F7005","_rev":"1-74a74942107db308d42864e50c1517aa", =
....
>=20
> I deleted the document and inserted it again but the changes feed =
remains
> the same as above - I presume the compact will still fail as before.
>=20
> Anything else I can do ? (I guess I could hack copy_docs so that =
not_found
> is not 'fatal').
>=20
> I am compacting regardless, maybe it'll pass.....
>=20
> Regards,
>=20
> Mike
>=20
> Citando Adam Kocoloski <kocolosk@apache.org>:
>=20
>> Ok, so this is the same error both times.  As far as I can tell it  =
indicates that the seq_tree and the id_tree indexes are out of sync;  =
the seq_tree contains some record that isn't present in the id_tree.   =
That's never supposed to happen, so the compactor crashes instead  of =
trying to deal with the 'not_found' result when it does a lookup  on the =
missing entry in the id_tree.
>>=20
>> I suspect that the _purge code is to blame, since deletions don't  =
actually remove entries from these indexes.  One thing you might try:
>>=20
>> 1) Query _changes starting from 96281148 (1000 less than the last  =
status update) and grab the next 1000 rows
>>=20
>> 2) Figure out which of those entries are missing from the id tree,  =
e.g. lookup the document and see if the response is  =
{"not_found":"missing"}.  You could also try using include_docs=3Dtrue  =
on the _changes feed to accomplish the same.
>>=20
>> 3) Once you've identified the problematic IDs, try creating them  =
again.  You might end up introducing duplicates in the _changes  feed, =
but if you do there's a procedure to fix that.
>>=20
>> That's the simplest solution I can think of.  Purging them again  =
won't work because the first thing _purge does is lookup the Ids in  the =
id_tree.  Regards,
>>=20
>> Adam
>>=20
>> On Jan 1, 2011, at 9:47 AM, mike@loop.com.br wrote:
>>=20
>>> I did the same with the tagged 1.0.1. Attached is
>>> the error produced. My responses are below:
>>>=20
>>> Citando Robert Newson <robert.newson@gmail.com>:
>>>=20
>>>> Some more info would help here.
>>>>=20
>>>> 1) How far did compaction get?
>>> It gets to seq 96282148 of 109105202 ie: 88%
>>>=20
>>>> 2) Do you have enough spare disk space?
>>> Yes I have lots of free space :-)
>>>=20
>>>> 3) What commit of 1.0.x were you running before you moved to =
08d71849?
>>> I was using Dec 13 852fa047. Before that something at least a month =
old.
>>>=20
>>>> B.
>>>>=20
>>>> On Fri, Dec 31, 2010 at 3:55 PM, Robert Newson   =
<robert.newson@gmail.com> wrote:
>>>>> Can you try this with a tagged release like 1.0.1?
>>>>>=20
>>>>> On Fri, Dec 31, 2010 at 3:38 PM,  <mike@loop.com.br> wrote:
>>>>>> Hello,
>>>>>>=20
>>>>>> Hoping for some guidance. I have a rather large (295Gb) database =
that was
>>>>>> created
>>>>>> running 1.0.x and I am pretty certain that there is no  =
corruption - It has
>>>>>> always
>>>>>> been on a clean ZFS volume.
>>>>>>=20
>>>>>> I upgraded to 1.0.x (08d71849464a8e1cc869b385591fa00b3ad0f843 =
git) in the
>>>>>> hope
>>>>>> that it may resolve the issue.
>>>>>>=20
>>>>>> I have previously '_purge'd many douments from this database  =
previously, so
>>>>>> that may be relevant.
>>>>>>=20
>>>>>> I am annexing the error from couchdb.log
>>>>>>=20
>>>>>> Thanks,
>>>>>>=20
>>>>>> Mike
>>>>>>=20
>>>>>=20
>>>>=20
>>>=20
>>>=20
>>> <error2.log>
>>=20
>>=20
>=20
>=20
>=20