monitor the repair using nodetool compactionstats to see the merkle trees being created, and nodetool netstats to see data streaming. 

Also look in the logs for messages from AntiEntropyService.java , that will tell you how long the node waited for each replica to get back to it. 

Cheers

-----------------
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton

On 4/04/2013, at 5:42 PM, Ondřej Černoš <cernoso@gmail.com> wrote:

Hi,

most has been resolved - the failed to uncompress error was really a
bug in cassandra (see
https://issues.apache.org/jira/browse/CASSANDRA-5391) and the problem
with different load reporting is a change between 1.2.1 (reports 100%
for 3 replicas/3 nodes/2 DCs setup I have) and 1.2.3 which reports the
fraction. Is this correct?

Anyway, the nodetool repair still takes ages to finish, considering
only megabytes of not changing data are involved in my test:

[root@host:/etc/puppet] nodetool repair ks
[2013-04-04 13:26:46,618] Starting repair command #1, repairing 1536
ranges for keyspace ks
[2013-04-04 13:47:17,007] Repair session
88ebc700-9d1a-11e2-a0a1-05b94e1385c7 for range
(-2270395505556181001,-2268004533044804266] finished
...
[2013-04-04 13:47:17,063] Repair session
65d31180-9d1d-11e2-a0a1-05b94e1385c7 for range
(1069254279177813908,1070290707448386360] finished
[2013-04-04 13:47:17,063] Repair command #1 finished

This is the status before the repair (by the way, after the datacenter
has been bootstrapped from the remote one):

[root@host:/etc/puppet] nodetool status
Datacenter: us-east
===================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address                   Load       Tokens  Owns   Host ID
                                   Rack
UN  xxx.xxx.xxx.xxx    5.74 MB    256     17.1%
06ff8328-32a3-4196-a31f-1e0f608d0638  1d
UN  xxx.xxx.xxx.xxx    5.73 MB    256     15.3%
7a96bf16-e268-433a-9912-a0cf1668184e  1d
UN  xxx.xxx.xxx.xxx    5.72 MB    256     17.5%
67a68a2a-12a8-459d-9d18-221426646e84  1d
Datacenter: na-dev
==================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address                  Load       Tokens  Owns   Host ID
                                      Rack
UN  xxx.xxx.xxx.xxx   5.74 MB    256     16.4%
eb86aaae-ef0d-40aa-9b74-2b9704c77c0a  cmp02
UN  xxx.xxx.xxx.xxx   5.74 MB    256     17.0%
cd24af74-7f6a-4eaa-814f-62474b4e4df1  cmp01
UN  xxx.xxx.xxx.xxx   5.74 MB    256     16.7%
1a55cfd4-bb30-4250-b868-a9ae13d81ae1  cmp05

Why does it take 20 minutes to finish? Fortunately the big number of
compactions I reported in the previous email was not triggered.

And is there a documentation where I could find the exact semantics of
repair when vnodes are used (and what -pr means in such a setup) and
when run in multiple datacenter setup? I still don't quite get it.

regards,
Ondřej Černoš


On Thu, Mar 28, 2013 at 3:30 AM, aaron morton <aaron@thelastpickle.com> wrote:
During one of my tests - see this thread in this mailing list:
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/java-io-IOException-FAILED-TO-UNCOMPRESS-5-exception-when-running-nodetool-rebuild-td7586494.html

That thread has been updated, check the bug ondrej created.

How will this perform in production with much bigger data if repair
takes 25 minutes on 7MB and 11k compactions were triggered by the
repair run?

Seems a little odd.
See what happens the next time you run repair.

Cheers

-----------------
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 27/03/2013, at 2:36 AM, Ondřej Černoš <cernoso@gmail.com> wrote:

Hi all,

I have 2 DCs, 3 nodes each, RF:3, I use local quorum for both reads and
writes.

Currently I test various operational qualities of the setup.

During one of my tests - see this thread in this mailing list:
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/java-io-IOException-FAILED-TO-UNCOMPRESS-5-exception-when-running-nodetool-rebuild-td7586494.html
- I ran into this situation:

- all nodes have all data and agree on it:

[user@host1-dc1:~] nodetool status

Datacenter: na-prod
===================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address                        Load         Tokens  Owns
(effective)  Host ID                                            Rack
UN  XXX.XXX.XXX.XXX   7.74 MB    256     100.0%
0b1f1d79-52af-4d1b-a86d-bf4b65a05c49  cmp17
UN  XXX.XXX.XXX.XXX   7.74 MB    256     100.0%
039f206e-da22-44b5-83bd-2513f96ddeac  cmp10
UN  XXX.XXX.XXX.XXX   7.72 MB    256     100.0%
007097e9-17e6-43f7-8dfc-37b082a784c4  cmp11
Datacenter: us-east
===================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address                        Load         Tokens  Owns
(effective)  Host ID                                            Rack
UN  XXX.XXX.XXX.XXX    7.73 MB    256     100.0%
a336efae-8d9c-4562-8e2a-b766b479ecb4  1d
UN  XXX.XXX.XXX.XXX    7.73 MB    256     100.0%
ab1bbf0a-8ddc-4a12-a925-b119bd2de98e  1d
UN  XXX.XXX.XXX.XXX     7.73 MB    256     100.0%
f53fd294-16cc-497e-9613-347f07ac3850  1d

- only one node disagrees:

[user@host1-dc2:~] nodetool status
Datacenter: us-east
===================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address                         Load       Tokens   Owns   Host ID
                                            Rack
UN  XXX.XXX.XXX.XXX    7.73 MB    256     17.6%
a336efae-8d9c-4562-8e2a-b766b479ecb4  1d
UN  XXX.XXX.XXX.XXX    7.75 MB    256     16.4%
ab1bbf0a-8ddc-4a12-a925-b119bd2de98e  1d
UN  XXX.XXX.XXX.XXX     7.73 MB    256     15.7%
f53fd294-16cc-497e-9613-347f07ac3850  1d
Datacenter: na-prod
===================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address                         Load       Tokens   Owns   Host ID
                                            Rack
UN  XXX.XXX.XXX.XXX   7.74 MB    256     16.9%
0b1f1d79-52af-4d1b-a86d-bf4b65a05c49  cmp17
UN  XXX.XXX.XXX.XXX   7.72 MB    256     17.1%
007097e9-17e6-43f7-8dfc-37b082a784c4  cmp11
UN  XXX.XXX.XXX.XXX   7.73 MB    256     16.3%
039f206e-da22-44b5-83bd-2513f96ddeac  cmp10

I tried to rebuild the node from scratch, repair the node, no results.
Still the same owns stats.

The cluster is built from cassandra 1.2.3 and uses vnodes.


On the related note: the data size, as you can see, is really small.
The cluster was created by setting up the us-east datacenter,
populating it with the dataset, then building the na-prod datacenter
and running nodetool rebuild us-east. When I tried to run nodetool
repair it took 25 minutes to finish, on this small dataset. Is this
ok?

One other think I notices is the amount of compactions on the system
keyspace:

/.../system/schema_columnfamilies/system-schema_columnfamilies-ib-11694-TOC.txt
/.../system/schema_columnfamilies/system-schema_columnfamilies-ib-11693-Statistics.db

This is just after running the repair. Is this ok, considering the
dataset is 7MB and during the repair no operations were running
against the database, neither read, nor write, nothing?

How will this perform in production with much bigger data if repair
takes 25 minutes on 7MB and 11k compactions were triggered by the
repair run?

regards,

Ondrej Cernos