Return-Path: Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: (qmail 98546 invoked from network); 6 Oct 2009 06:48:07 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 6 Oct 2009 06:48:07 -0000 Received: (qmail 86483 invoked by uid 500); 6 Oct 2009 06:48:06 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 86406 invoked by uid 500); 6 Oct 2009 06:48:06 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 86396 invoked by uid 99); 6 Oct 2009 06:48:06 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Oct 2009 06:48:06 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of glenn@rempe.us designates 209.85.210.176 as permitted sender) Received: from [209.85.210.176] (HELO mail-yx0-f176.google.com) (209.85.210.176) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Oct 2009 06:47:54 +0000 Received: by yxe6 with SMTP id 6so4396319yxe.13 for ; Mon, 05 Oct 2009 23:46:32 -0700 (PDT) MIME-Version: 1.0 Received: by 10.90.19.20 with SMTP id 20mr525353ags.88.1254811592162; Mon, 05 Oct 2009 23:46:32 -0700 (PDT) In-Reply-To: References: <6cfccb3d0910030910m678c9157o811a57e02ffbeff0@mail.gmail.com> <6cfccb3d0910031824q5d9362c2k7719b32a779e956f@mail.gmail.com> <1521F562-7878-423B-BEC3-FF2A1916DA0C@gmail.com> <6cfccb3d0910032307i1df64f79u437d5c803bff2c50@mail.gmail.com> From: Glenn Rempe Date: Mon, 5 Oct 2009 23:46:12 -0700 Message-ID: <6cfccb3d0910052346hb4ca0b3o7eb6e77fec4ac1a7@mail.gmail.com> Subject: Re: Timeout Error when trying to access views + Indexing problems To: user@couchdb.apache.org Content-Type: multipart/alternative; boundary=00163628354835fe6904753e9754 X-Virus-Checked: Checked by ClamAV on apache.org --00163628354835fe6904753e9754 Content-Type: text/plain; charset=UTF-8 Hi Adam (and group!). So I am still struggling, and need help or advice. It is much appreciated. Adam, I tried your suggestion to attempt to open the index file in interactive mode. I was able to start CouchDB in interactive mode, and the 'couch_file:open' command return ok instantly, but the 'couch_file:read_header(Fd).' provided no output for about 15 min and I eventually aborted it. Should it ever take that long? What output should I expect to see? Please note that to try and make sure there was nothing wrong with the index files, I actually destroyed them and tried a rebuild from scratch. This seemed to be going ok, and the indexing process ran all night. I got up to roughly 20% of the 22 million records in the DB indexed when it mysteriously died yet again. I am noticing that it is at this point that I am always seeing this process fail. When the indexing fails it provides NO information in the couchdb log (with log level set to 'debug'). I backtracked in the couchdb log file to find where it stopped working, looking for a stack trace. All I found was the point at which it just stopped reporting progress and the couchjs process was dead. Here is that point in the log file: http://pastie.org/642736 And this is the last line reporting indexing status (line 25 in the pastie): [Mon, 05 Oct 2009 18:52:42 GMT] [debug] [<0.80.0>] New task status for searchlight_production _design/SearchDocument: Processed 4377984 of 22650375 changes (19%) After that, just silence in the log. Here is one hint which suggests that perhaps it is not just the couchjs process that is crashing, but perhaps the entire couchdb process? In the /var/log/syslog I see the following three minutes after the last entry showing index status in the couch log: http://pastie.org/642952 This is being logged by monit, which I am using to keep an eye on couchdb. I may turn off monit, and see if the entire couchdb process crashes when the indexing stops. UPDATE : Even with monit turned off, it looks like the couchdb pids have changed during or after the indexing process failure. This is the 'ps -ef' output after the crash, to compare with the output in the previous link which was taken when the indexing was started: http://pastie.org/643335 END UPDATE: If I try and access that view in a rails console session after this failure I am getting a 500 error: >> counts = SearchDocument.by_engine(:reduce => true, :group => true, :stale => 'ok')['rows'] RestClient::RequestFailed: HTTP status code 500 And I see the following stack trace in the couch log: [Mon, 05 Oct 2009 21:53:00 GMT] [info] [<0.301.0>] Stacktrace: [{gen_server,call,2}, {couch_view,get_group_server,2}, {couch_view,get_group,3}, {couch_view,get_map_view,4}, {couch_httpd_view,design_doc_view,5}, {couch_httpd_db,do_db_req,2}, {couch_httpd,handle_request,5}, {mochiweb_http,headers,5}] Here is the full output: http://pastie.org/642940 Is it possible that either: - there is some kind of corruption in the main DB file and when this point is reached (or a specific record in the DB?) that it crashes? If so how can I best identify this? - There is some file system related issue? I tend to discount this option just because it always seems to fail at roughly the same point in the process. What are the recommended next steps for finding the root cause of the issue? Should I insert a log() statement in the view code and try to run the whole index with that in place? UPDATE : I tried to run the index again with monit turned off, since I saw some messages in the syslog about the couchdb process. I was hoping that perhaps monit was inadvertantly killing and restarting couchdb which was causing the index failure. Alas this was not the case. It just died on me again after 3.6 million records. And there is some different error messages in the log immediately after it stopped reporting on indexing progress. Please take a look at this pastie to see the logs: http://pastie.org/643330 (There's a blog of error messages staring at line 25) I am going to try tonight splitting the 7 views I currently have in a single design doc in to 4 design docs according to the part of the app they pertain to. I am hoping that this will narrow it down if some design docs finish indexing and others don't. Thanks again for the help (and reading this long message!). Glenn On Sun, Oct 4, 2009 at 10:19 AM, Adam Kocoloski wrote: > Hi Glenn, I saw something like this once, but unfortunately I wasn't able > to resolve it. Can you try the following? > > 1) Start couchdb with the -i option to get an interactive shell > > 2) At the prompt, try to open the view file and read the header > > {ok, Fd} = > couch_file:open("/vol/couchdb/var/lib/couchdb/.searchlight_production_design/5f190ffb2ed589746e8796d2423869ac.view"). > couch_file:read_header(Fd). > > In my case, that call timed out, and the stacktrace during normal operation > was exactly the one you showed earlier in this thread. Best, > > Adam > > > On Oct 4, 2009, at 2:07 AM, Glenn Rempe wrote: > > Thanks for the reply Paul. Some comments below. >> >> Also, just for full disclosure, the CouchDB I am working on was moved out >> of >> another couchdb and it was originally created using CDB 0.9.1. I show a >> dir >> listing below that indicates exactly what was moved. >> >> On Sat, Oct 3, 2009 at 6:46 PM, Paul Joseph Davis < >> paul.joseph.davis@gmail.com> wrote: >> >> Glenn, >>> >>> This sounds like your map function is timing out which causes the error. >>> You could try upping the os process timeout setting in the config. >>> >>> >>> When I go into futon and select one of my views in my design doc it >> *always* >> consistently pops up a javascript alert with the error text at ~5 seconds >> after selecting the view. It doesn't seem to matter what else I do. It >> also didn't vary when I changed the os_process_timeout value in futon as >> you >> suggested from 5000 to 25000. Can you explain exactly what this >> particular >> param is doing? I assume the value is milliseconds? >> >> >> To see what's going on you can increase to debug logging or use the log >>> function in your maps. There's also the status page in futon which I >>> think >>> you said you were looking at. >>> >>> >>> Yes, I was previously looking at the status page. But now since I've >> upgraded to trunk I never see any indexing activity happening in the >> status >> page no matter what I do. >> >> >> If indexing crashes it should just pick up where it left off when you >>> retrigger. Use the status page to verify. If it's not then let us know. >>> >>> >>> Can you clarify, is this also the case when no index has ever >> successfully >> run? I was wondering if I first need to get through at least one index >> session (maybe with a smaller amount of records) prior to incremental >> indexing working as expected. >> >> Is there any way to determine what percentage of the total records have >> been >> added to the index? >> >> For your info, here are the contents of the DB dir. You can see the main >> DB >> is 42GB now (~17 million records). >> >> root@ip-10-250-55-239 >> :/home/rails/underscore-sync-mysql-to-couchdb/shared/log# >> ls -la /vol/couchdb/var/lib/couchdb >> total 41674956 >> drwxr-xr-x 3 couchdb root 78 2009-10-04 00:46 . >> drwxr-xr-x 3 couchdb root 20 2009-10-03 05:02 .. >> -rw-r--r-- 1 couchdb couchdb 42675073133 2009-10-04 02:13 >> searchlight_production.couch >> drwxr-xr-x 2 couchdb couchdb 120 2009-10-03 06:02 >> .searchlight_production_design >> >> root@ip-10-250-55-239 >> :/home/rails/underscore-sync-mysql-to-couchdb/shared/log# >> ls -la /vol/couchdb/var/lib/couchdb/.searchlight_production_design/ >> total 33700196 >> drwxr-xr-x 2 couchdb couchdb 120 2009-10-03 06:02 . >> drwxr-xr-x 3 couchdb root 78 2009-10-04 00:46 .. >> -rw-r--r-- 1 couchdb couchdb 9819347287 2009-10-03 08:04 >> 5f190ffb2ed589746e8796d2423869ac.view >> -rw-r--r-- 1 couchdb couchdb 91402872 2009-10-03 06:03 >> b127a58306fb8e7858cd1a92f8398511.view >> -rw-r--r-- 1 couchdb couchdb 24598236884 2009-10-02 13:00 >> SearchDocument.view >> >> >> >> If you can't find anything in the debug logs then ping the lust and we'll >>> get into trying to duplicate. >>> >>> >>> I have turned on the 'debug' level in the logs and that provided me with >> the >> info I previously provided. I'll try to use the log function in the map >> and >> see if that shows anything. >> >> Thanks for helping. If it comes to it, I may be able to make a snapshot >> of >> this EBS volume and start a host that you could login to and get your >> hands >> directly on it if that would be helpful. >> >> Glenn >> > > -- Glenn Rempe email : glenn@rempe.us voice : (415) 894-5366 or (415)-89G-LENN twitter : @grempe contact info : http://www.rempe.us/contact.html pgp : http://www.rempe.us/gnupg.txt --00163628354835fe6904753e9754--