incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Newson <rnew...@apache.org>
Subject Re: BigCouch - Replication failing with Cannot Allocate memory
Date Mon, 16 Apr 2012 14:35:25 GMT
Bigcouch is built as an erlang release, so it includes all the bits of
erlang needed to run. As part of the packaging, I also packaged
spidermonkey, which should have been pulled in automatically.

B.

On 16 April 2012 15:32, Mike Kimber <mkimber@kana.com> wrote:
> I used the instructions on http://bigcouch.cloudant.com/use  for RHEL/centos so used
yum to install. Which installed bigcouch-0.4.0-1.
>
> I did not install Erlang and spidermonkey as the above seemed to do it for me (I hope
or I'm going to look v stupid and it would be a miracle its running at all!)
>
> Mike
>
> -----Original Message-----
> From: Robert Newson [mailto:rnewson@apache.org]
> Sent: 14 April 2012 14:35
> To: user@couchdb.apache.org
> Subject: Re: BigCouch - Replication failing with Cannot Allocate memory
>
> Mike,
>
> Thanks for the logs, they do look clean, as you said.
>
> It was remiss of me not to ask for version numbers. Can you tell me
> which bigcouch version, erlang version, spidermonkey version you have
> here?
>
> B.
>
> On 13 April 2012 21:18, Mike Kimber <mkimber@kana.com> wrote:
>> A clean log file (i.e. stop bigcouch, delete log file, restart bigcouch, run replication,
wait for failure, stop bigcouch) from the node that failed this time around can be found at:
>>
>> http://pastebin.com/embed_js.php?i=s52rYwwy
>>
>> Mike
>>
>> -----Original Message-----
>> From: Robert Newson [mailto:rnewson@apache.org]
>> Sent: 13 April 2012 19:28
>> To: user@couchdb.apache.org
>> Subject: Re: BigCouch - Replication failing with Cannot Allocate memory
>>
>> Mike,
>>
>> Do you have couch.logs from around that time?
>>
>> B.
>>
>> On 13 April 2012 17:54, Mike Kimber <mkimber@kana.com> wrote:
>>> Sorry forgot to say that I have already up'd it to N=3 and still get the same
issue.
>>>
>>> I ran it again with the 6GB of RAM on each of the servers and ran vmstat and
got the following:
>>>
>>> r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs
us sy id wa st
>>>  3  0      0 2067468  31816 302204    0    0     0     5 1820  360
63 32  5  0  0
>>>  2  0      0 2457728  31816 302212    0    0     0     2 2188  322
70 25  4  0  0
>>>  2  0      0 1936092  31816 302212    0    0     0     0 3020  200
73 24  3  0  0
>>>  2  0      0 687428  31816 302212    0    0     0     1 1958  368
56 42  2  0  0
>>>  2  0      0 2128192  31824 302212    0    0     0     2 2779  243
64 29  7  0  0
>>>  1  0      0 1829848  31824 302216    0    0     0     0 1734  280
68 29  3  0  0
>>>  1  0      0 1200300  31832 302216    0    0     0     8 1841  231
43 13 44  0  0
>>>  2  0      0 1638752  31840 302208    0    0     0     5 2625  350
71 20  8  0  0
>>>  3  0      0 1670856  31848 302216    0    0     0     3 2150  492
40 21 39  0  0
>>>  2  0      0 1020848  31848 302216    0    0     0     0 2307  644
67 22 11  0  0
>>>  1  0      0 271640  31848 302216    0    0     0     6 1995  280
54 42  4  0  0
>>>  1  0      0 455408  31848 302216    0    0     0     1 1879  238
64 33  3  0  0
>>>  2  0      0 1240616  25584 193044    0    0     0     2 2408  232
59 34  8  0  0
>>>  2  0      0 611280  25592 193036    0    0     0     3 2286  246
72 25  2  0  0
>>>  2  0      0 679548  25592 193044    0    0     0     2 3038  175
78 21  2  0  0
>>>  2  0      0 786360  25600 193044    0    0     0     3 1679  269
74 23  3  0  0
>>>  2  0      0 568632  25600 193044    0    0     0     0 2796  274
74 24  2  0  0
>>> eheap_alloc: Cannot allocate 1824525600 bytes of memory (of type "heap").
>>>  0  0      0 5749480  25600 193044    0    0     0     0 1389  160
33 15 52  0  0
>>>  0  0      0 5749956  25608 193044    0    0     0    10 1007  
82  0  0 100  0  0
>>>  0  0      0 5749988  25616 193036    0    0     0     3 1016  
85  0  0 100  0  0
>>>  0  0      0 5750020  25616 193044    0    0     0     0  998
  79  0  0 100  0  0
>>>  0  0      0 5750168  25620 193040    0    0     0     1 1007  
87  0  0 100  0  0
>>>  0  0      0 5750308  25620 193044    0    0     0     0 1008  
82  0  0 100  0  0
>>>
>>> I really need to work out what each process is doing with respect to memory at
the time of failure. I had top running, but not on the node that failed this time, sods law
:-)
>>>
>>> Mike
>>>
>>> -----Original Message-----
>>> From: Robert Newson [mailto:rnewson@apache.org]
>>> Sent: 13 April 2012 17:31
>>> To: user@couchdb.apache.org
>>> Subject: Re: BigCouch - Replication failing with Cannot Allocate memory
>>>
>>> I should note that bigcouch is tested much more often with N=3.
>>> Perhaps there's something about N=1 that exasperates the issue. For a
>>> test, could you try with N=3?
>>>
>>> B.
>>>
>>> On 13 April 2012 16:24, Mike Kimber <mkimber@kana.com> wrote:
>>>> "1. Try to replicate the database in another CouchDB."
>>>>
>>>> I have done this to a couchdb 1.2 database successfully. FYI The Source DB
is a couchdb 1.1.1.
>>>>
>>>> I haven't done the other tests, but have tested replicating from the couchdb
1.2 database to the bigcouch install and got the same issue.
>>>>
>>>> Mike
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: CGS [mailto:cgsmcmlxxv@gmail.com]
>>>> Sent: 13 April 2012 15:01
>>>> To: user@couchdb.apache.org
>>>> Subject: Re: BigCouch - Replication failing with Cannot Allocate memory
>>>>
>>>> If you say so, Robert, I won't argue with you on that. I meant no offense,
>>>> so, please, accept my apologies if I crossed the line. It's all your's from
>>>> now on.
>>>>
>>>> Mike, please, ignore my suggestion. Sorry for interfering.
>>>>
>>>> Good luck!
>>>>
>>>> CGS
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Apr 13, 2012 at 3:19 PM, Robert Newson <rnewson@apache.org>
wrote:
>>>>
>>>>> I think you should point out that "My idea behind these tests is that
>>>>> it may be that your database may be
>>>>> corrupted (or seen as corrupted by BigCouch at the second test) and what
>>>>> you get is just garbage at a certain document. " is based on no
>>>>> evidence. Nor, if it were true, would it necessarily explain the
>>>>> observed behavior either.
>>>>>
>>>>> It would be useful if we could all stick to asserting only things we
>>>>> know to be true or have reasonable grounds to believe are true.
>>>>> Unfounded speculation, though offered sincerely, is not helpful on a
>>>>> mailing list intended to provide assistance.
>>>>>
>>>>> Thanks,
>>>>> B.
>>>>>
>>>>> On 13 April 2012 13:55, CGS <cgsmcmlxxv@gmail.com> wrote:
>>>>> > Hi Mike,
>>>>> >
>>>>> > I haven't used BigCouch by now and that's why I haven't said anything
by
>>>>> > now. Still, giving a thought of what may occur there, I propose
few tests
>>>>> > if you have time:
>>>>> > 1. Try to replicate the database in another CouchDB.
>>>>> > 2. If 1 passes, try to replicate to only one node at the time.
>>>>> > 3. If 2 passes, increase the pool of nodes with 1 and repeat the
>>>>> > replication (for sure it will fail at all 3 nodes at the time).
>>>>> >
>>>>> > My idea behind these tests is that it may be that your database
may be
>>>>> > corrupted (or seen as corrupted by BigCouch at the second test)
and what
>>>>> > you get is just garbage at a certain document. That's why I proposed
the
>>>>> > first test. The second test is to see if any of the nodes has a
problem
>>>>> in
>>>>> > configuration (or if there is any incompatibility in between your
CouchDB
>>>>> > and BigCouch in manipulating your docs). Finally, the third test
is to
>>>>> see
>>>>> > if server/node resources limit the number of replications (and at
how
>>>>> many
>>>>> > it starts to fail).
>>>>> >
>>>>> > Can you also check the size of the shards at tests 2 and 3?
>>>>> >
>>>>> > If you consider that these tests are irrelevant, please, ignore
my
>>>>> > suggestion.
>>>>> >
>>>>> > CGS
>>>>> >
>>>>> >
>>>>> >
>>>>> > On Fri, Apr 13, 2012 at 1:27 PM, Mike Kimber <mkimber@kana.com>
wrote:
>>>>> >
>>>>> >> I upped the memory to 6GB on each of the nodes and got exactly
the same
>>>>> >> issue in the same time frame i.e. the increased RAM did not
seem to by
>>>>> me
>>>>> >> any additional time.
>>>>> >>
>>>>> >> Mike
>>>>> >>
>>>>> >> -----Original Message-----
>>>>> >> From: Robert Newson [mailto:rnewson@apache.org]
>>>>> >> Sent: 12 April 2012 19:34
>>>>> >> To: user@couchdb.apache.org
>>>>> >> Subject: Re: BigCouch - Replication failing with Cannot Allocate
memory
>>>>> >>
>>>>> >> 2GB total ram does sound tight. I can only compare to high volume
>>>>> >> production clusters which have much more ram than this. Given
that
>>>>> >> beam.smp wanted 1.4 gb and you have 2gb total, do you know where
the
>>>>> >> rest one? To couchjs processes, by chance? If so, you can reduce
the
>>>>> >> maximum size of that pool in config, I think the default is
50.
>>>>> >>
>>>>> >> On 12 April 2012 18:32, Mike Kimber <mkimber@kana.com>
wrote:
>>>>> >> > Ok, I have 3 nodes all load balanced with HAproxy:
>>>>> >> >
>>>>> >> > Centos 5.8 (Virtualised)
>>>>> >> > 2 Cores
>>>>> >> > 2GB RAM
>>>>> >> >
>>>>> >> > I'm trying to replicate about 75K documents which total
6GB when
>>>>> >> compacted (0n Couchdb 1.2 which has compression turned on).
I'm told
>>>>> they
>>>>> >> are fairly large documents.
>>>>> >> >
>>>>> >> > When it goes pear shaped Vsmstat starts using a lot of
memory:
>>>>> >> >
>>>>> >> > procs -----------memory---------- ---swap-- -----io----
--system--
>>>>> >> -----cpu------
>>>>> >> >  r  b   swpd   free   buff  cache   si   so  
 bi    bo   in   cs us
>>>>> sy
>>>>> >> id wa st
>>>>> >> >  1  2 570576   8808    140   7208 2998 2249  3154
 2249 1234  569  1
>>>>>  6
>>>>> >>  2 91  0
>>>>> >> >  0  2 569656   9156    156   7504 2330 1899  2405
 1904 1246  595  1
>>>>>  5
>>>>> >>  9 85  0
>>>>> >> >  1  1 575412   9516    236  14928 1549 2261  3242
 2261 1237  593  1
>>>>>  7
>>>>> >>  1 91  0
>>>>> >> >  0  2 607092  13220    168   8156 3772 9012  3871
 9017 1284  714  1
>>>>> 10
>>>>> >>  4 85  0
>>>>> >> >  1  0 444336 857004    220  10212 5781    0  6202
    0 1574 1010 13
>>>>>  7
>>>>> >> 33 47  0
>>>>> >> >  1  0 442176 870684    428  11052 2049    0  2208
  140 2561 1541 17
>>>>>  8
>>>>> >> 49 26  0
>>>>> >> >  0  0 442176 813140    460  11968  170    0  
348     0 2672 1565 25
>>>>>  9
>>>>> >> 61  4  0
>>>>> >> >  0  1 442176 744972    484  12224 5440    0  5493
    7 2432  900  8
>>>>>  4
>>>>> >> 49 40  0
>>>>> >> >  0  1 442176 714048    484  12296 4547    0  4547
    0 1799  827  4
>>>>>  2
>>>>> >> 50 44  0
>>>>> >> >  0  1 442176 686304    496  12688 5128    0  5222
    0 1696  999  9
>>>>>  2
>>>>> >> 50 40  0
>>>>> >> >  0  3 444000   8712    444  12876  299  368  
331   380 1294  188 22
>>>>> 20
>>>>> >> 36 23  0
>>>>> >> >  0  3 469340  10040    116   7336   29 5087    74
 5090 1232  268  3
>>>>> 22
>>>>> >>  0 75  0
>>>>> >> >  1  2 584356  10220    124   6744 11367 28722 11370
28722 1643 1300  5
>>>>> >> 19 17 59  0
>>>>> >> >  0  1 624908  10640    132   7036 6518 12879  6590
12884 1296  717  3
>>>>> 10
>>>>> >> 29 58  0
>>>>> >> >  0  2 652556  10948    252  14776 3799 9494  5459
 9494 1294  646  2
>>>>>  9
>>>>> >> 32 57  0
>>>>> >> >  0  2 677784  10648    244  14528 3819 8196  3819
 8201 1274  588  2
>>>>>  7
>>>>> >> 30 61  0
>>>>> >> >  0  2 688460   9512    212   8224 3013 4522  3125
 4522 1379  519  2
>>>>>  7
>>>>> >>  6 84  0
>>>>> >> >  0  3 699164   9888    208   8468 2192 4014  2228
 4014 1302  495  1
>>>>>  6
>>>>> >> 11 83  0
>>>>> >> >  2  0 713104   9004    144   9192 2606 4490  2848
 4490 1350  487  1
>>>>>  8
>>>>> >> 16 75  0
>>>>> >> >
>>>>> >> > It only ever takes out one node at a time and the other
nodes seem to
>>>>> be
>>>>> >> doing very little while the one node is running out of memory.
>>>>> >> >
>>>>> >> > If I kick it off again it processed some more and then
spikes the
>>>>> memory
>>>>> >> and fails
>>>>> >> >
>>>>> >> > Thanks
>>>>> >> >
>>>>> >> > Mike
>>>>> >> >
>>>>> >> > PS: hope you enjoyed you couchdb get together!
>>>>> >> >
>>>>> >> > -----Original Message-----
>>>>> >> > From: Robert Newson [mailto:rnewson@apache.org]
>>>>> >> > Sent: 12 April 2012 17:28
>>>>> >> > To: user@couchdb.apache.org
>>>>> >> > Subject: Re: BigCouch - Replication failing with Cannot
Allocate
>>>>> memory
>>>>> >> >
>>>>> >> > What kind of load were you putting the machine on?
>>>>> >> >
>>>>> >> > On 12 April 2012 17:24, Robert Newson <rnewson@apache.org>
wrote:
>>>>> >> >> Could you show your vm.args file?
>>>>> >> >>
>>>>> >> >> On 12 April 2012 17:23, Robert Newson <rnewson@apache.org>
wrote:
>>>>> >> >>> Unfortunately your request for help coincided with
the two day
>>>>> CouchDB
>>>>> >> >>> Summit. #cloudant and the Issues tab on cloudant/bigcouch
are other
>>>>> >> >>> ways to get bigcouch support, but we happily answer
queries here
>>>>> too,
>>>>> >> >>> when not at the Model UN of CouchDB. :D
>>>>> >> >>>
>>>>> >> >>> B.
>>>>> >> >>>
>>>>> >> >>> On 12 April 2012 17:10, Mike Kimber <mkimber@kana.com>
wrote:
>>>>> >> >>>> Looks like this isn't the right place based
on the responses so
>>>>> far.
>>>>> >> Shame I hoped this was going to help solve our index/view rebuild
times
>>>>> etc.
>>>>> >> >>>>
>>>>> >> >>>> Mike
>>>>> >> >>>>
>>>>> >> >>>> -----Original Message-----
>>>>> >> >>>> From: Mike Kimber [mailto:mkimber@kana.com]
>>>>> >> >>>> Sent: 10 April 2012 09:20
>>>>> >> >>>> To: user@couchdb.apache.org
>>>>> >> >>>> Subject: BigCouch - Replication failing with
Cannot Allocate memory
>>>>> >> >>>>
>>>>> >> >>>> I'm not sure if this is the correct place to
raise an issue I am
>>>>> >> having with replicating a standalone couchdb 1.1.1 to a 3 node
BigCouch
>>>>> >> cluster? If this is not the correct place please point me in
the right
>>>>> >> direction if it is then any one have any ideas why I keep getting
the
>>>>> >> following error message when I kick of a replication;
>>>>> >> >>>>
>>>>> >> >>>> eheap_alloc: Cannot allocate 1459620480 bytes
of memory (of type
>>>>> >> "heap").
>>>>> >> >>>>
>>>>> >> >>>> My set-up is:
>>>>> >> >>>>
>>>>> >> >>>> Standalone couchdb 1.1.1 running on Centos
5.7
>>>>> >> >>>>
>>>>> >> >>>> 3 Node BigCouch cluster running on Centos 5.8
with the following
>>>>> >> local.ini overrides pulling from the Standalone couchdb (78K
documents)
>>>>> >> >>>>
>>>>> >> >>>> [httpd]
>>>>> >> >>>> bind_address = XXX.XX.X.XX
>>>>> >> >>>>
>>>>> >> >>>> [cluster]
>>>>> >> >>>> ; number of shards for a new database
>>>>> >> >>>> q = 9
>>>>> >> >>>> ; number of copies of each shard
>>>>> >> >>>> n = 1
>>>>> >> >>>>
>>>>> >> >>>> [couchdb]
>>>>> >> >>>> database_dir = /other/bigcouch/database
>>>>> >> >>>> view_index_dir = /other/bigcouch/view
>>>>> >> >>>>
>>>>> >> >>>> The error is always generate on the third node
in the cluster and
>>>>> the
>>>>> >> server basically max's out on memory before hand. The other
nodes seem
>>>>> to
>>>>> >> be doing very little, but are getting data i.e. the shard sizes
are
>>>>> >> growing. I've put the copies per shard down to 1 as currently
I'm not
>>>>> >> interested in resilience.
>>>>> >> >>>>
>>>>> >> >>>> Any help would be greatly appreciated.
>>>>> >> >>>>
>>>>> >> >>>> Mike
>>>>> >> >>>>
>>>>> >>
>>>>>

Mime
View raw message