incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Davis <paul.joseph.da...@gmail.com>
Subject Re: Timeout Error when trying to access views + Indexing problems
Date Tue, 06 Oct 2009 10:21:12 GMT
Its 6:15am, so I'll keep this short,

The monit scripts I saw proposed in the other thread worried me a bit
but I don't know much about monit. You generally don't want to use
monit on the couchdb PID, but the PID of heart. CouchdDB (and Erlang)
are designed to restart as part of their normal (recovery) procedure.
Which normally freaks monit out from my experience. But maybe I was
Doing It Wrong.

[Tue, 06 Oct 2009 06:24:09 GMT] [error] [<0.20.0>] {error_report,<0.7.0>,
              {<0.20.0>,std_error,
               "File operation error: eacces. Target:
./couch_doc.beam. Function: get_file. Process: code_server."}}

^^^ We've seen that error before.

http://mail-archives.apache.org/mod_mbox/couchdb-user/200908.mbox/%3cBF2D758D-A311-4592-8BE1-CA8A35A976F1@me.com%3e

Sadly I don't see a resolution. Sounds similar though.

Also, still confuses the shit out of me. Anyone remember if that ever
got fixed? Niket?

Paul J Davis

On Tue, Oct 6, 2009 at 2:46 AM, Glenn Rempe <glenn@rempe.us> wrote:
> Hi Adam (and group!).  So I am still struggling, and need help or advice.
>  It is much appreciated.
> Adam, I tried your suggestion to attempt to open the index file in
> interactive mode.  I was able to start CouchDB in interactive mode, and the
> 'couch_file:open' command return ok instantly, but the
> 'couch_file:read_header(Fd).' provided no output for about 15 min and I
> eventually aborted it.  Should it ever take that long?  What output should I
> expect to see?
>
> Please note that to try and make sure there was nothing wrong with the index
> files, I actually destroyed them and tried a rebuild from scratch.  This
> seemed to be going ok, and the indexing process ran all night.  I got up to
> roughly 20% of the 22 million records in the DB indexed when it mysteriously
> died yet again.  I am noticing that it is at this point that I am always
> seeing this process fail.
>
> When the indexing fails it provides NO information in the couchdb log (with
> log level set to 'debug').  I backtracked in the couchdb log file to find
> where it stopped working, looking for a stack trace.  All I found was the
> point at which it just stopped reporting progress and the couchjs process
> was dead.  Here is that point in the log file:
>
> http://pastie.org/642736
>
> And this is the last line reporting indexing status (line 25 in the pastie):
>
> [Mon, 05 Oct 2009 18:52:42 GMT] [debug] [<0.80.0>] New task status for
> searchlight_production _design/SearchDocument: Processed 4377984 of 22650375
> changes (19%)
>
> After that, just silence in the log.
>
> Here is one hint which suggests that perhaps it is not just the couchjs
> process that is crashing, but perhaps the entire couchdb process?  In the
> /var/log/syslog I see the following three minutes after the last entry
> showing index status in the couch log:
>
> http://pastie.org/642952
>
> This is being logged by monit, which I am using to keep an eye on couchdb.
>  I may turn off monit, and see if the entire couchdb process crashes when
> the indexing stops.
>
> UPDATE :
> Even with monit turned off, it looks like the couchdb pids have changed
> during or after the indexing process failure.  This is the 'ps -ef' output
> after the crash, to compare with the output in the previous link which was
> taken when the indexing was started:
>
> http://pastie.org/643335
> END UPDATE:
>
> If I try and access that view in a rails console session after this failure
> I am getting a 500 error:
>
>>> counts = SearchDocument.by_engine(:reduce => true, :group => true, :stale
> => 'ok')['rows']
> RestClient::RequestFailed: HTTP status code 500
>
> And I see the following stack trace in the couch log:
>
> [Mon, 05 Oct 2009 21:53:00 GMT] [info] [<0.301.0>] Stacktrace:
> [{gen_server,call,2},
>             {couch_view,get_group_server,2},
>             {couch_view,get_group,3},
>             {couch_view,get_map_view,4},
>             {couch_httpd_view,design_doc_view,5},
>             {couch_httpd_db,do_db_req,2},
>             {couch_httpd,handle_request,5},
>             {mochiweb_http,headers,5}]
>
> Here is the full output:
>
> http://pastie.org/642940
>
> Is it possible that either:
>
> - there is some kind of corruption in the main DB file and when this point
> is reached (or a specific record in the DB?) that it crashes? If so how can
> I best identify this?
> - There is some file system related issue?  I tend to discount this option
> just because it always seems to fail at roughly the same point in the
> process.
>
> What are the recommended next steps for finding the root cause of the issue?
>  Should I insert a log() statement in the view code and try to run the whole
> index with that in place?
>
> UPDATE : I tried to run the index again with monit turned off, since I saw
> some messages in the syslog about the couchdb process.  I was hoping that
> perhaps monit was inadvertantly killing and restarting couchdb which was
> causing the index failure.  Alas this was not the case.  It just died on me
> again after 3.6 million records.  And there is some different error messages
> in the log immediately after it stopped reporting on indexing progress.
>  Please take a look at this pastie to see the logs:
>
> http://pastie.org/643330
>
> (There's a blog of error messages staring at line 25)
>
> I am going to try tonight splitting the 7 views I currently have in a single
> design doc in to 4 design docs according to the part of the app they pertain
> to.  I am hoping that this will narrow it down if some design docs finish
> indexing and others don't.
>
> Thanks again for the help (and reading this long message!).
>
> Glenn
>
>
> On Sun, Oct 4, 2009 at 10:19 AM, Adam Kocoloski <kocolosk@apache.org> wrote:
>
>> Hi Glenn, I saw something like this once, but unfortunately I wasn't able
>> to resolve it.  Can you try the following?
>>
>> 1) Start couchdb with the -i option to get an interactive shell
>>
>> 2) At the prompt, try to open the view file and read the header
>>
>> {ok, Fd} =
>> couch_file:open("/vol/couchdb/var/lib/couchdb/.searchlight_production_design/5f190ffb2ed589746e8796d2423869ac.view").
>> couch_file:read_header(Fd).
>>
>> In my case, that call timed out, and the stacktrace during normal operation
>> was exactly the one you showed earlier in this thread.  Best,
>>
>> Adam
>>
>>
>> On Oct 4, 2009, at 2:07 AM, Glenn Rempe wrote:
>>
>>  Thanks for the reply Paul.  Some comments below.
>>>
>>> Also, just for full disclosure, the CouchDB I am working on was moved out
>>> of
>>> another couchdb and it was originally created using CDB 0.9.1.  I show a
>>> dir
>>> listing below that indicates exactly what was moved.
>>>
>>> On Sat, Oct 3, 2009 at 6:46 PM, Paul Joseph Davis <
>>> paul.joseph.davis@gmail.com> wrote:
>>>
>>>  Glenn,
>>>>
>>>> This sounds like your map function is timing out which causes the error.
>>>> You could try upping the os process timeout setting in the config.
>>>>
>>>>
>>>>  When I go into futon and select one of my views in my design doc it
>>> *always*
>>> consistently pops up a javascript alert with the error text at ~5 seconds
>>> after selecting the view.  It doesn't seem to matter what else I do.  It
>>> also didn't vary when I changed the os_process_timeout value in futon as
>>> you
>>> suggested from 5000 to 25000.  Can you explain exactly what this
>>> particular
>>> param is doing?  I assume the value is milliseconds?
>>>
>>>
>>>  To see what's going on you can increase to debug logging or use the log
>>>> function in your maps. There's also the status page in futon which I
>>>> think
>>>> you said you were looking at.
>>>>
>>>>
>>>>  Yes, I was previously looking at the status page.  But now since I've
>>> upgraded to trunk I never see any indexing activity happening in the
>>> status
>>> page no matter what I do.
>>>
>>>
>>>  If indexing crashes it should just pick up where it left off when you
>>>> retrigger. Use the status page to verify. If it's not then let us know.
>>>>
>>>>
>>>>  Can you clarify, is this also the case when no index has ever
>>> successfully
>>> run?  I was wondering if I first need to get through at least one index
>>> session (maybe with a smaller amount of records) prior to incremental
>>> indexing working as expected.
>>>
>>> Is there any way to determine what percentage of the total records have
>>> been
>>> added to the index?
>>>
>>> For your info, here are the contents of the DB dir.  You can see the main
>>> DB
>>> is 42GB now (~17 million records).
>>>
>>> root@ip-10-250-55-239
>>> :/home/rails/underscore-sync-mysql-to-couchdb/shared/log#
>>> ls -la /vol/couchdb/var/lib/couchdb
>>> total 41674956
>>> drwxr-xr-x 3 couchdb root             78 2009-10-04 00:46 .
>>> drwxr-xr-x 3 couchdb root             20 2009-10-03 05:02 ..
>>> -rw-r--r-- 1 couchdb couchdb 42675073133 2009-10-04 02:13
>>> searchlight_production.couch
>>> drwxr-xr-x 2 couchdb couchdb         120 2009-10-03 06:02
>>> .searchlight_production_design
>>>
>>> root@ip-10-250-55-239
>>> :/home/rails/underscore-sync-mysql-to-couchdb/shared/log#
>>> ls -la /vol/couchdb/var/lib/couchdb/.searchlight_production_design/
>>> total 33700196
>>> drwxr-xr-x 2 couchdb couchdb         120 2009-10-03 06:02 .
>>> drwxr-xr-x 3 couchdb root             78 2009-10-04 00:46 ..
>>> -rw-r--r-- 1 couchdb couchdb  9819347287 2009-10-03 08:04
>>> 5f190ffb2ed589746e8796d2423869ac.view
>>> -rw-r--r-- 1 couchdb couchdb    91402872 2009-10-03 06:03
>>> b127a58306fb8e7858cd1a92f8398511.view
>>> -rw-r--r-- 1 couchdb couchdb 24598236884 2009-10-02 13:00
>>> SearchDocument.view
>>>
>>>
>>>
>>>  If you can't find anything in the debug logs then ping the lust and we'll
>>>> get into trying to duplicate.
>>>>
>>>>
>>>>  I have turned on the 'debug' level in the logs and that provided me with
>>> the
>>> info I previously provided.  I'll try to use the log function in the map
>>> and
>>> see if that shows anything.
>>>
>>> Thanks for helping.  If it comes to it, I may be able to make a snapshot
>>> of
>>> this EBS volume and start a host that you could login to and get your
>>> hands
>>> directly on it if that would be helpful.
>>>
>>> Glenn
>>>
>>
>>
>
>
> --
> Glenn Rempe
>
> email                 : glenn@rempe.us
> voice                 : (415) 894-5366 or (415)-89G-LENN
> twitter                : @grempe
> contact info        : http://www.rempe.us/contact.html
> pgp                    : http://www.rempe.us/gnupg.txt
>

Mime
View raw message