cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keith Wright <kwri...@nanigans.com>
Subject Re: Hot, large row
Date Thu, 24 Jul 2014 20:50:12 GMT
When a node is showing the high CMS issue, io is actually low likely due to the fact that none
is going on during CMS GC.  On a node not showing the issue, iostat shows disk usage around
50% (these are SSD) and load hovers around 10 for a dual octo core machine this is fine.

In addition, nodetool compactionstats does not show that we are falling behind in compactions.

So not sure what is going on here.  We are running CentOS 6.5 with java 1.7.0_51.  It does
seem like things are getting worse and I’m considering dropping and rebuilding all the tables
(as I have the data in Hadoop).  This seems to be a repeated problem for us with Cassandra
and now that Aerospike has an open source version, we are very much considering switching.

Thanks again for the help and any insight you might have!


avg-cpu:  %user   %nice %system %iowait  %steal   %idle

          23.43   12.40   11.40    6.20    0.00   46.57


Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await
 svctm  %util

sda               0.00     0.00  720.50    0.00     5.51     0.00    15.66     1.05    1.54
  0.57  41.35

sdc               0.00     0.00 5930.50 2100.50    55.42     8.21    16.23     2.30    0.29
  0.06  51.25

sdb               0.00     0.00 6951.50 2052.50    65.82     8.02    16.80     4.31    0.48
  0.07  59.60


avg-cpu:  %user   %nice %system %iowait  %steal   %idle

           9.48   14.72    5.60    3.67    0.00   66.52


Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await
 svctm  %util

sda               0.00     0.00  327.00   19.50     2.55     0.08    15.53     0.04    0.13
  0.12   4.25

sdc               0.00     0.00 3974.50 1403.50    36.37     5.48    15.94     0.99    0.18
  0.08  45.10

sdb               0.00     0.00 4357.50 1535.00    40.83     6.00    16.28     1.10    0.19
  0.08  47.45

From: DuyHai Doan <doanduyhai@gmail.com<mailto:doanduyhai@gmail.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Thursday, July 24, 2014 at 4:44 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Re: Hot, large row

For global_user_event_skus_v2

1. number of SSTables per read is quite huge. Considering you're using LCS, it means that
LCS cannot keep up with write rate and is left behind. AFAIK LCS is using SizeTieredCompaction
at L0 to cope with extreme write burst. Your high number of SSTables per read is quite fishy
here.

2. Write latency is widespead up to 73.457 millisecs, meaning that your node is getting behind
for write in some cases. Most of write are still below 1 millisec but we don't care. What
we care here is the large trail of write latency climbing up to 73 millisecs

3. Same remarks for read latency which is worst because the distribution is even "flatter",
worst cases going up to 100 ms.

If I were you, I'll check for disk I/O first and maybe CPU usage


On Thu, Jul 24, 2014 at 10:32 PM, Keith Wright <kwright@nanigans.com<mailto:kwright@nanigans.com>>
wrote:
Cfhistograms for the tables I believe are most likely the issue are below on the node that
most recently presented the issue.  Any ideas?  Note that these tables are LCS and have droppable
tombstone ratios of 27% for global_user_event_skus_v2 and 2.7% for the other.  Table definitions
also below, you can see that they do use compound keys.

Thanks


CREATE TABLE global_user_event_skus_v2 (

  user_id timeuuid,

  app_id int,

  type text,

  name text,

  sku text,

  creationtime timestamp,

  PRIMARY KEY (user_id, app_id, type, name, sku)

) WITH

  bloom_filter_fp_chance=0.100000 AND

  caching='KEYS_ONLY' AND

  comment='' AND

  dclocal_read_repair_chance=0.000000 AND

  gc_grace_seconds=86400 AND

  index_interval=128 AND

  read_repair_chance=0.100000 AND

  replicate_on_write='true' AND

  populate_io_cache_on_flush='false' AND

  default_time_to_live=2592000 AND

  speculative_retry='99.0PERCENTILE' AND

  memtable_flush_period_in_ms=0 AND

  compaction={'class': 'LeveledCompactionStrategy'} AND

  compression={'sstable_compression': 'LZ4Compressor'};



users/global_user_event_skus_v2 histograms


SSTables per Read

 1 sstables: 1398122

 2 sstables: 231100

 3 sstables: 97334

 4 sstables: 56496

 5 sstables: 43046

 6 sstables: 35471

 7 sstables: 26888

 8 sstables: 21102

10 sstables: 29522

12 sstables: 18314

14 sstables: 11368

17 sstables: 10283

20 sstables: 5766

24 sstables: 3554

29 sstables: 1681

35 sstables: 521

42 sstables: 102

50 sstables: 1


Write Latency (microseconds)

     2 us: 532

     3 us: 10728

     4 us: 49608

     5 us: 173718

     6 us: 404131

     7 us: 697468

     8 us: 981048

    10 us: 2643680

    12 us: 3183727

    14 us: 3150881

    17 us: 3746333

    20 us: 2298481

    24 us: 1647604

    29 us: 1070733

    35 us: 693300

    42 us: 442120

    50 us: 288618

    60 us: 205061

    72 us: 141194

    86 us: 98781

   103 us: 73629

   124 us: 54900

   149 us: 40518

   179 us: 30330

   215 us: 23953

   258 us: 18875

   310 us: 14147

   372 us: 10509

   446 us: 7984

   535 us: 6257

   642 us: 4903

   770 us: 4086

   924 us: 3252

  1109 us: 2676

  1331 us: 2129

  1597 us: 1753

  1916 us: 1488

  2299 us: 1237

  2759 us: 1078

  3311 us: 866

  3973 us: 704

  4768 us: 551

  5722 us: 553

  6866 us: 774

  8239 us: 443

  9887 us: 303

 11864 us: 222

 14237 us: 181

 17084 us: 146

 20501 us: 147

 24601 us: 499

 29521 us: 842

 35425 us: 548

 42510 us: 303

 51012 us: 560

 61214 us: 842

 73457 us: 600

 88148 us: 283

105778 us: 173

126934 us: 81

152321 us: 18

182785 us: 2


Read Latency (microseconds)

     35 us: 2

     42 us: 72

     50 us: 737

     60 us: 3555

     72 us: 6926

     86 us: 8373

    103 us: 15287

    124 us: 30054

    149 us: 46164

    179 us: 67828

    215 us: 86741

    258 us: 92178

    310 us: 82205

    372 us: 69641

    446 us: 64710

    535 us: 58295

    642 us: 55247

    770 us: 57918

    924 us: 56928

   1109 us: 75762

   1331 us: 122306

   1597 us: 141564

   1916 us: 117301

   2299 us: 102293

   2759 us: 98789

   3311 us: 86923

   3973 us: 75304

   4768 us: 64012

   5722 us: 54973

   6866 us: 47221

   8239 us: 41483

   9887 us: 35015

  11864 us: 30249

  14237 us: 23505

  17084 us: 18364

  20501 us: 12992

  24601 us: 8837

  29521 us: 6497

  35425 us: 4840

  42510 us: 3536

  51012 us: 2471

  61214 us: 2784

  73457 us: 3664

  88148 us: 3219

 105778 us: 1980

 126934 us: 1050

 152321 us: 519

 182785 us: 184

 219342 us: 82

 263210 us: 39

 315852 us: 22

 379022 us: 16

 454826 us: 0

 545791 us: 0

 654949 us: 1

 785939 us: 1

 943127 us: 0

1131752 us: 0

1358102 us: 0

1629722 us: 3

1955666 us: 3

2346799 us: 1

2816159 us: 0

3379391 us: 0

4055269 us: 0

4866323 us: 0

5839588 us: 0

7007506 us: 4


Partition Size (bytes)

     35 bytes: 31

     42 bytes: 0

     50 bytes: 0

     60 bytes: 0

     72 bytes: 0

     86 bytes: 0

    103 bytes: 761108

    124 bytes: 23596164

    149 bytes: 22963275

    179 bytes: 3517971

    215 bytes: 6395335

    258 bytes: 6197982

    310 bytes: 4662589

    372 bytes: 5261934

    446 bytes: 3032032

    535 bytes: 3775310

    642 bytes: 2879935

    770 bytes: 3002743

    924 bytes: 2279900

   1109 bytes: 1976357

   1331 bytes: 1643875

   1597 bytes: 1422199

   1916 bytes: 1142746

   2299 bytes: 901584

   2759 bytes: 760173

   3311 bytes: 596874

   3973 bytes: 480502

   4768 bytes: 380185

   5722 bytes: 303315

   6866 bytes: 239013

   8239 bytes: 187337

   9887 bytes: 147218

  11864 bytes: 115684

  14237 bytes: 89718

  17084 bytes: 69160

  20501 bytes: 53070

  24601 bytes: 40496

  29521 bytes: 30344

  35425 bytes: 22656

  42510 bytes: 16419

  51012 bytes: 11621

  61214 bytes: 8020

  73457 bytes: 5561

  88148 bytes: 3708

 105778 bytes: 2434

 126934 bytes: 1561

 152321 bytes: 1017

 182785 bytes: 559

 219342 bytes: 349

 263210 bytes: 181

 315852 bytes: 121

 379022 bytes: 66

 454826 bytes: 22

 545791 bytes: 14

 654949 bytes: 10

 785939 bytes: 2

 943127 bytes: 0

1131752 bytes: 4

1358102 bytes: 2

1629722 bytes: 1

1955666 bytes: 2

2346799 bytes: 0

2816159 bytes: 0

3379391 bytes: 0

4055269 bytes: 0

4866323 bytes: 1


Cell Count per Partition

    1 cells: 50599458

    2 cells: 13975363

    3 cells: 7927898

    4 cells: 4981142

    5 cells: 3081166

    6 cells: 3034895

    7 cells: 1896720

    8 cells: 1508849

   10 cells: 2418845

   12 cells: 1631289

   14 cells: 1179269

   17 cells: 1301647

   20 cells: 919744

   24 cells: 864313

   29 cells: 735203

   35 cells: 597749

   42 cells: 469611

   50 cells: 367422

   60 cells: 312697

   72 cells: 251782

   86 cells: 197717

  103 cells: 161244

  124 cells: 132074

  149 cells: 103711

  179 cells: 81760

  215 cells: 63360

  258 cells: 48982

  310 cells: 37932

  372 cells: 28328

  446 cells: 21347

  535 cells: 15373

  642 cells: 10959

  770 cells: 7643

  924 cells: 5346

 1109 cells: 3613

 1331 cells: 2291

 1597 cells: 1558

 1916 cells: 885

 2299 cells: 577

 2759 cells: 327

 3311 cells: 169

 3973 cells: 116

 4768 cells: 57

 5722 cells: 23

 6866 cells: 15

 8239 cells: 9

 9887 cells: 3

11864 cells: 0

14237 cells: 5

17084 cells: 1

20501 cells: 0

24601 cells: 2

29521 cells: 0

35425 cells: 0

42510 cells: 0

51012 cells: 0

61214 cells: 1


CREATE TABLE global_user_events (

  user_id timeuuid,

  app_id int,

  type text,

  name text,

  extra_param map<text, text>,

  paid boolean,

  value float,

  PRIMARY KEY (user_id, app_id, type, name)

) WITH

  bloom_filter_fp_chance=0.100000 AND

  caching='KEYS_ONLY' AND

  comment='' AND

  dclocal_read_repair_chance=0.000000 AND

  gc_grace_seconds=86400 AND

  index_interval=128 AND

  read_repair_chance=0.100000 AND

  replicate_on_write='true' AND

  populate_io_cache_on_flush='false' AND

  default_time_to_live=0 AND

  speculative_retry='99.0PERCENTILE' AND

  memtable_flush_period_in_ms=0 AND

  compaction={'class': 'LeveledCompactionStrategy'} AND

  compression={'sstable_compression': 'LZ4Compressor'};


users/global_user_events histograms


SSTables per Read

 1 sstables: 1024446

 2 sstables: 632948

 3 sstables: 230768

 4 sstables: 112428

 5 sstables: 82651

 6 sstables: 61439

 7 sstables: 47518

 8 sstables: 38490

10 sstables: 58054

12 sstables: 41128

14 sstables: 29642

17 sstables: 31056

20 sstables: 20474

24 sstables: 17799

29 sstables: 13744

35 sstables: 9485

42 sstables: 5933

50 sstables: 3607

60 sstables: 1618

72 sstables: 263

86 sstables: 10


Write Latency (microseconds)

     10 us: 10

     12 us: 75

     14 us: 319

     17 us: 3550

     20 us: 20617

     24 us: 105254

     29 us: 311290

     35 us: 494404

     42 us: 516301

     50 us: 534061

     60 us: 421438

     72 us: 314377

     86 us: 375240

    103 us: 333402

    124 us: 816891

    149 us: 667832

    179 us: 430659

    215 us: 336063

    258 us: 380606

    310 us: 235592

    372 us: 68679

    446 us: 14988

    535 us: 6169

    642 us: 3763

    770 us: 2591

    924 us: 2270

   1109 us: 1985

   1331 us: 1744

   1597 us: 1564

   1916 us: 1489

   2299 us: 1326

   2759 us: 1114

   3311 us: 1024

   3973 us: 984

   4768 us: 833

   5722 us: 962

   6866 us: 1076

   8239 us: 891

   9887 us: 611

  11864 us: 489

  14237 us: 324

  17084 us: 306

  20501 us: 222

  24601 us: 426

  29521 us: 791

  35425 us: 820

  42510 us: 534

  51012 us: 530

  61214 us: 946

  73457 us: 1122

  88148 us: 714

 105778 us: 313

 126934 us: 127

 152321 us: 38

 182785 us: 3

 219342 us: 2

 263210 us: 0

 315852 us: 0

 379022 us: 0

 454826 us: 2

 545791 us: 0

 654949 us: 3

 785939 us: 0

 943127 us: 0

1131752 us: 0

1358102 us: 0

1629722 us: 0

1955666 us: 0

2346799 us: 1

2816159 us: 0

3379391 us: 0

4055269 us: 0

4866323 us: 0

5839588 us: 0

7007506 us: 1


Read Latency (microseconds)

     42 us: 1

     50 us: 2

     60 us: 2

     72 us: 4

     86 us: 15

    103 us: 23

    124 us: 29

    149 us: 48

    179 us: 67

    215 us: 80

    258 us: 316

    310 us: 1812

    372 us: 4999

    446 us: 8876

    535 us: 14586

    642 us: 24417

    770 us: 31943

    924 us: 37453

   1109 us: 77449

   1331 us: 139135

   1597 us: 177538

   1916 us: 169568

   2299 us: 177617

   2759 us: 202552

   3311 us: 198101

   3973 us: 179807

   4768 us: 159149

   5722 us: 142282

   6866 us: 126124

   8239 us: 111529

   9887 us: 97658

  11864 us: 84913

  14237 us: 70905

  17084 us: 54917

  20501 us: 40623

  24601 us: 29545

  29521 us: 21448

  35425 us: 15255

  42510 us: 11222

  51012 us: 8150

  61214 us: 7473

  73457 us: 10060

  88148 us: 10530

 105778 us: 7573

 126934 us: 4331

 152321 us: 2097

 182785 us: 836

 219342 us: 255

 263210 us: 84

 315852 us: 28

 379022 us: 8

 454826 us: 2

 545791 us: 0

 654949 us: 2

 785939 us: 1

 943127 us: 0

1131752 us: 0

1358102 us: 0

1629722 us: 2

1955666 us: 6

2346799 us: 18

2816159 us: 0

3379391 us: 0

4055269 us: 0

4866323 us: 0

5839588 us: 0

7007506 us: 34


Partition Size (bytes)

    35 bytes: 25

    42 bytes: 0

    50 bytes: 0

    60 bytes: 0

    72 bytes: 21761

    86 bytes: 694466658

   103 bytes: 219007195

   124 bytes: 7034596

   149 bytes: 110269079

   179 bytes: 101276811

   215 bytes: 38518027

   258 bytes: 103242356

   310 bytes: 51434914

   372 bytes: 38048245

   446 bytes: 42558831

   535 bytes: 32869176

   642 bytes: 17188026

   770 bytes: 14750559

   924 bytes: 15963256

  1109 bytes: 16125324

  1331 bytes: 11462272

  1597 bytes: 6810452

  1916 bytes: 5685739

  2299 bytes: 10232324

  2759 bytes: 5138522

  3311 bytes: 5847998

  3973 bytes: 4065752

  4768 bytes: 3468263

  5722 bytes: 2545431

  6866 bytes: 1701913

  8239 bytes: 993808

  9887 bytes: 495282

 11864 bytes: 208644

 14237 bytes: 74307

 17084 bytes: 22286

 20501 bytes: 5474

 24601 bytes: 1225

 29521 bytes: 210

 35425 bytes: 97

 42510 bytes: 48

 51012 bytes: 21

 61214 bytes: 15

 73457 bytes: 10

 88148 bytes: 3

105778 bytes: 4

126934 bytes: 1

152321 bytes: 2

182785 bytes: 1


Cell Count per Partition

   1 cells: 914323128

   2 cells: 196020828

   3 cells: 165502164

   4 cells: 71735648

   5 cells: 54450466

   6 cells: 28995756

   7 cells: 9625976

   8 cells: 10257788

  10 cells: 24004757

  12 cells: 13087147

  14 cells: 13588347

  17 cells: 11976353

  20 cells: 6245374

  24 cells: 5627821

  29 cells: 9480533

  35 cells: 6008276

  42 cells: 5221357

  50 cells: 4355454

  60 cells: 3955488

  72 cells: 2872578

  86 cells: 1914447

 103 cells: 1216276

 124 cells: 652035

 149 cells: 279565

 179 cells: 99836

 215 cells: 29132

 258 cells: 6682

 310 cells: 1276

 372 cells: 258

 446 cells: 96

 535 cells: 48

 642 cells: 25

 770 cells: 9

 924 cells: 9

1109 cells: 4

1331 cells: 3

1597 cells: 1

1916 cells: 2

From: Jack Krupansky <jack@basetechnology.com<mailto:jack@basetechnology.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Thursday, July 24, 2014 at 4:07 PM

To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Re: Hot, large row

Could it be some “fat columns” (cells with large blob or text values) rather than the
cell-count per se? IOW, a “big row” rather than a “wide row”?

And, could it be a large partition rather than a large row (many rows in a single partition)?
Are clustering columns being used in the primary key?

-- Jack Krupansky

From: DuyHai Doan<mailto:doanduyhai@gmail.com>
Sent: Thursday, July 24, 2014 3:53 PM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: Hot, large row

Your extract of cfhistograms show that there are no particular "wide rows". The widest has
61214 cells which is big but not that huge to be really a concern.

Turning on trace probabilty only tells give you some "hints" about what kind of queries are
done, it does not give the exact partition key nor other statement values, especially when
you are using prepared statements ...


"I am considering reducing read_request_timeout_in_ms: 5000 in cassandra.yaml so that it reduces
the impact when this occurs." --> Don't do that, you'll only sweep dust under the carpet.
Find the real issue and fix it instead of changing parameter to hide it.

One solution would be on client side, to activate some logging to show the CQL3 statements
the application is issuing that may overload the server.  I know that's better said than done
but I don't have any other idea for the moment

-------- Shameless self-promotion ------

To support this kind of live prod debugging & investigation that I add a new dynamic query
logging feature in Achilles: https://github.com/doanduyhai/Achilles/wiki/Statements-Logging-and-Tracing#dynamic-statements-logging

Once you hit the issue, this kind of feature may save your day...






On Thu, Jul 24, 2014 at 9:22 PM, Keith Wright <kwright@nanigans.com<mailto:kwright@nanigans.com>>
wrote:
I can see from cfhistograms that I do have some wide rows (see below).  I set trace probability
as you suggested but the output doesn’t appear to tell me what row was actually read unless
I missed something.  I just see executing prepared statement.   Any ideas how I can find the
row in question?

I am considering reducing read_request_timeout_in_ms: 5000 in cassandra.yaml so that it reduces
the impact when this occurs.

Any help in identifying my issue would be GREATLY appreciated


Cell Count per Partition

    1 cells: 50449950

    2 cells: 14281828

    3 cells: 8093366

    4 cells: 5029200

    5 cells: 3103023<tel:3103023>

    6 cells: 3059903

    7 cells: 1903018

    8 cells: 1509297

   10 cells: 2420359

   12 cells: 1624895

   14 cells: 1171678

   17 cells: 1289391

   20 cells: 909777

   24 cells: 852081

   29 cells: 722925

   35 cells: 587067

   42 cells: 459473

   50 cells: 358744

   60 cells: 304146

   72 cells: 244682

   86 cells: 191045

  103 cells: 155337

  124 cells: 127061

  149 cells: 98913

  179 cells: 77454

  215 cells: 59849

  258 cells: 46117

  310 cells: 35321

  372 cells: 26319

  446 cells: 19379

  535 cells: 13783

  642 cells: 9993

  770 cells: 6973

  924 cells: 4713

1109 cells: 3229

1331 cells: 2062

1597 cells: 1338

1916 cells: 773

2299 cells: 495

2759 cells: 268

3311 cells: 150

3973 cells: 100

4768 cells: 42

5722 cells: 24

6866 cells: 12

8239 cells: 9

9887 cells: 3

11864 cells: 0

14237 cells: 5

17084 cells: 1

20501 cells: 0

24601 cells: 2

29521 cells: 0

35425 cells: 0

42510 cells: 0

51012 cells: 0

61214 cells: 2


From: DuyHai Doan <doanduyhai@gmail.com<mailto:doanduyhai@gmail.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Thursday, July 24, 2014 at 3:01 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Re: Hot, large row

"How can I detect wide rows?" -->

nodetool cfhistograms <keyspace> <suspected column family>

Look at column "Column count" (last column) and identify a line in this column with very high
value of "Offset". In a well designed application you should have a gaussian distribution
where 80% of your row have a similar number of columns.

"Anyone know what debug level I can set so that I can see what reads the hot node is handling?
 " -->

"nodetool settraceprobability <value>",  where value is a small number (0.001) on the
node where you encounter the issue. Activate the tracing for a while (5 mins) and deactivate
it (value = 0). Then look into system_traces tables "events" & "sessions". It may help
or not since the tracing is done once every 1000.

"Any way to get the server to blacklist these wide rows automatically?" --> No


On Thu, Jul 24, 2014 at 8:48 PM, Keith Wright <kwright@nanigans.com<mailto:kwright@nanigans.com>>
wrote:
Hi all,

   We are seeing an issue where basically daily one of our nodes spikes in load and is churning
in CMS heap pressure.  It appears that reads are backing up and my guess is that our application
is reading a large row repeatedly.  Our write structure can lead itself to wide rows very
infrequently (<0.001%) and we do our best to detect and delete them but obviously we’re
missing a case.  Hoping for assistance on the following questions:

 *   How can I detect wide rows?
 *   Anyone know what debug level I can set so that I can see what reads the hot node is handling?
 I’m hoping to see the “bad” row
 *   Any way to get the server to blacklist these wide rows automatically?

We’re using C* 2.0.6 with Vnodes.

Thanks




Mime
View raw message