kudu-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jdcry...@apache.org
Subject [1/2] kudu git commit: KUDU-1775 (part 3): enforce max cell size and max PK size
Date Thu, 15 Dec 2016 17:34:27 GMT
Repository: kudu
Updated Branches:
  refs/heads/branch-1.2.x 89dd76268 -> e3308b59e


KUDU-1775 (part 3): enforce max cell size and max PK size

This adds limits on the size of any individual cell as well as a limit
on the maximum size of an encoded key.

* Large cells are known to cause problems such as KUDU-1524 (crash
  during flush).
* Large PKs have been seen to cause problems where the key column's
  cfile footer increases beyond the hard-coded maximum size of 64KB.

If a user attempts to insert a non-conforming row, or update a cell to
be larger than the maximum, they will receive an InvalidArgument error
for that row.

This patch takes the approach of doing the validation at the tablet
layer rather than somewhere higher up the stack. The reasoning is that
we don't want to reject an entire batch of operations due to one bad
row, and we are not well equipped to set per-row errors anywhere except
at the tablet layer.

Because this is an incompatible change (it's possible that users have
been successfully using larger cells) the limits are added as
configurable flags, but tagged as unsafe. A new test verifies that, if
the user has insertions in the WAL that were previously allowed but now
are disallowed, the bootstrap fails without any crashes. It also tests
that by raising the flag to a higher value, bootstraps will succeed.
Making this test pass involved a small fix in bootstrap to properly
handle the case where an Apply failed on restart the previously
succeeded.

Change-Id: Ib6b94cffd9c3efbe80a4b31e9272b376a4e41d15
Reviewed-on: http://gerrit.cloudera.org:8080/5475
Tested-by: Kudu Jenkins
Reviewed-by: Adar Dembo <adar@cloudera.com>
(cherry picked from commit 0259586af3611862645ffcbccca9e4f928e1095b)
Reviewed-on: http://gerrit.cloudera.org:8080/5513
Reviewed-by: Jean-Daniel Cryans <jdcryans@apache.org>


Project: http://git-wip-us.apache.org/repos/asf/kudu/repo
Commit: http://git-wip-us.apache.org/repos/asf/kudu/commit/65ddfc50
Tree: http://git-wip-us.apache.org/repos/asf/kudu/tree/65ddfc50
Diff: http://git-wip-us.apache.org/repos/asf/kudu/diff/65ddfc50

Branch: refs/heads/branch-1.2.x
Commit: 65ddfc50892819f4f2b93d6e255ac9a99ddb1909
Parents: 89dd762
Author: Todd Lipcon <todd@apache.org>
Authored: Mon Dec 12 20:58:15 2016 +0700
Committer: Jean-Daniel Cryans <jdcryans@apache.org>
Committed: Thu Dec 15 17:16:41 2016 +0000

----------------------------------------------------------------------
 docs/schema_design.adoc                         |  22 ++--
 src/kudu/client/client-test.cc                  |  44 ++++++++
 .../integration-tests/raft_consensus-itest.cc   |  32 ++++--
 src/kudu/integration-tests/test_workload.cc     |   2 +-
 src/kudu/integration-tests/ts_recovery-itest.cc |  54 ++++++++++
 src/kudu/tablet/tablet.cc                       | 104 ++++++++++++++++---
 src/kudu/tablet/tablet.h                        |   8 ++
 src/kudu/tablet/tablet_bootstrap.cc             |  26 +++--
 src/kudu/tserver/tablet_server-test.cc          |  25 ++++-
 9 files changed, 276 insertions(+), 41 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/kudu/blob/65ddfc50/docs/schema_design.adoc
----------------------------------------------------------------------
diff --git a/docs/schema_design.adoc b/docs/schema_design.adoc
index 8b82af5..7c991f3 100644
--- a/docs/schema_design.adoc
+++ b/docs/schema_design.adoc
@@ -73,8 +73,8 @@ column types include:
 * unixtime_micros (64-bit microseconds since the Unix epoch)
 * single-precision (32-bit) IEEE-754 floating-point number
 * double-precision (64-bit) IEEE-754 floating-point number
-* UTF-8 encoded string
-* binary
+* UTF-8 encoded string (up to 64KB)
+* binary (up to 64KB)
 
 Kudu takes advantage of strongly-typed columns and a columnar on-disk storage
 format to provide efficient encoding and serialization. To make the most of
@@ -447,14 +447,17 @@ designing your schema, consider these limitations together, not in isolation.
If
 test these limitations and your findings are different from these, please share your
 test cases and results.
 
-Number of Columns:: Kudu has not been thoroughly tested with more than 200 columns
-and we recommend schemas with fewer than 50 columns per table.
+Number of Columns:: By default, Kudu will not permit the creation of tables with
+more than 300 columns. We recommend schema designs that use fewer columns for best
+performance.
 
-Size of Rows:: Kudu has not been thoroughly tested with rows larger than 10 kb. Most
-testing has been on rows at 1 kb.
+Size of Cells:: No individual cell may be larger than 64KB. The cells making up a
+a composite key are limited to a total of 16KB after the internal composite-key encoding
+done by Kudu. Inserting rows not conforming to these limitations will result in errors
+being returned to the client.
 
-Size of Cells:: There is no hard limit imposed by Kudu, however large cells may
-push the entire row over the recommended size.
+Size of Rows:: Although individual cells may be up to 64KB, and Kudu supports up to
+300 columns, it is recommended that no single row be larger than a few hundred KB.
 
 Immutable Primary Keys:: Kudu does not allow you to update the primary key
 columns of a row.
@@ -463,7 +466,8 @@ Non-alterable Primary Key:: Kudu does not allow you to alter the primary
key
 columns after table creation.
 
 Non-alterable Partitioning:: Kudu does not allow you to change how a table is
-partitioned after creation.
+partitioned after creation, with the exception of adding or dropping range
+partitions.
 
 Non-alterable Column Types:: Kudu does not allow the type of a column to be
 altered.

http://git-wip-us.apache.org/repos/asf/kudu/blob/65ddfc50/src/kudu/client/client-test.cc
----------------------------------------------------------------------
diff --git a/src/kudu/client/client-test.cc b/src/kudu/client/client-test.cc
index e467164..a9e6f80 100644
--- a/src/kudu/client/client-test.cc
+++ b/src/kudu/client/client-test.cc
@@ -3906,6 +3906,50 @@ TEST_F(ClientTest, TestCreateTableWithTooLongColumnName) {
                      "longer than maximum permitted length 256");
 }
 
+// Test trying to insert a row with an encoded key that is too large.
+TEST_F(ClientTest, TestInsertTooLongEncodedPrimaryKey) {
+  const string kLongValue(10000, 'x');
+  const char* kTableName = "too-long-pk";
+
+  // Create and open a table with a three-column composite PK.
+  unique_ptr<KuduTableCreator> table_creator(client_->NewTableCreator());
+  KuduSchema schema;
+  KuduSchemaBuilder schema_builder;
+  schema_builder.AddColumn("k1")->Type(KuduColumnSchema::STRING)->NotNull();
+  schema_builder.AddColumn("k2")->Type(KuduColumnSchema::STRING)->NotNull();
+  schema_builder.AddColumn("k3")->Type(KuduColumnSchema::STRING)->NotNull();
+  schema_builder.SetPrimaryKey({"k1", "k2", "k3"});
+  ASSERT_OK(schema_builder.Build(&schema));
+  ASSERT_OK(table_creator->table_name(kTableName)
+            .schema(&schema)
+            .num_replicas(1)
+            .set_range_partition_columns({ "k1" })
+            .Create());
+  shared_ptr<KuduTable> table;
+  ASSERT_OK(client_->OpenTable(kTableName, &table));
+
+  // Create a session and insert a row with a too-large composite key.
+  shared_ptr<KuduSession> session(client_->NewSession());
+  ASSERT_OK(session->SetFlushMode(KuduSession::MANUAL_FLUSH));
+  gscoped_ptr<KuduInsert> insert(table->NewInsert());
+  for (int i = 0; i < 3; i++) {
+    ASSERT_OK(insert->mutable_row()->SetStringCopy(i, kLongValue));
+  }
+  ASSERT_OK(session->Apply(insert.release()));
+  Status s = session->Flush();
+  ASSERT_FALSE(s.ok()) << s.ToString();
+
+  // Check the error.
+  vector<KuduError*> errors;
+  ElementDeleter drop(&errors);
+  bool overflowed;
+  session->GetPendingErrors(&errors, &overflowed);
+  ASSERT_EQ(1, errors.size());
+  EXPECT_EQ("Invalid argument: encoded primary key too large "
+            "(30004 bytes, maximum is 65536 bytes)",
+            errors[0]->status().ToString());
+}
+
 
 // Check the behavior of the latest observed timestamp when performing
 // write and read operations.

http://git-wip-us.apache.org/repos/asf/kudu/blob/65ddfc50/src/kudu/integration-tests/raft_consensus-itest.cc
----------------------------------------------------------------------
diff --git a/src/kudu/integration-tests/raft_consensus-itest.cc b/src/kudu/integration-tests/raft_consensus-itest.cc
index cf83164..bccedbb 100644
--- a/src/kudu/integration-tests/raft_consensus-itest.cc
+++ b/src/kudu/integration-tests/raft_consensus-itest.cc
@@ -648,7 +648,9 @@ TEST_F(RaftConsensusITest, TestCatchupAfterOpsEvicted) {
     // Run the maintenance manager frequently so that we don't have to wait
     // long for GC.
     "--maintenance_manager_polling_interval_ms=100",
-    "--log_target_replay_size_mb=1"
+    "--log_target_replay_size_mb=1",
+    // We write 128KB cells in this test, so bump the limit.
+    "--max_cell_size_bytes=1000000"
   };
   BuildAndStart(extra_flags);
   TServerDetails* replica = (*tablet_replicas_.begin()).second;
@@ -787,8 +789,13 @@ void RaftConsensusITest::CauseFollowerToFallBehindLogGC(string* leader_uuid,
 //
 // This is a regression test for KUDU-775 and KUDU-562.
 TEST_F(RaftConsensusITest, TestFollowerFallsBehindLeaderGC) {
-  // Disable follower eviction to maintain the original intent of this test.
-  vector<string> extra_flags = { "--evict_failed_followers=false" };
+  vector<string> extra_flags = {
+    // Disable follower eviction to maintain the original intent of this test.
+    "--evict_failed_followers=false",
+    // We write 128KB cells in this test, so bump the limit.
+    "--max_cell_size_bytes=1000000"
+  };
+
   AddFlagsForLogRolls(&extra_flags); // For CauseFollowerToFallBehindLogGC().
   BuildAndStart(extra_flags);
 
@@ -2692,9 +2699,15 @@ TEST_F(RaftConsensusITest, TestHammerOneRow) {
 // Test that followers that fall behind the leader's log GC threshold are
 // evicted from the config.
 TEST_F(RaftConsensusITest, TestEvictAbandonedFollowers) {
-  vector<string> ts_flags;
+  vector<string> ts_flags = {
+    // We write 128KB cells in this test, so bump the limit.
+    "--max_cell_size_bytes=1000000"
+  };
   AddFlagsForLogRolls(&ts_flags); // For CauseFollowerToFallBehindLogGC().
-  vector<string> master_flags = { "--master_add_server_when_underreplicated=false"
};
+  vector<string> master_flags = {
+    "--master_add_server_when_underreplicated=false",
+  };
+
   NO_FATALS(BuildAndStart(ts_flags, master_flags));
 
   MonoDelta timeout = MonoDelta::FromSeconds(30);
@@ -2716,9 +2729,12 @@ TEST_F(RaftConsensusITest, TestEvictAbandonedFollowers) {
 // Test that, after followers are evicted from the config, the master re-adds a new
 // replica for that follower and it eventually catches back up.
 TEST_F(RaftConsensusITest, TestMasterReplacesEvictedFollowers) {
-  vector<string> extra_flags;
-  AddFlagsForLogRolls(&extra_flags); // For CauseFollowerToFallBehindLogGC().
-  BuildAndStart(extra_flags);
+  vector<string> ts_flags = {
+    // We write 128KB cells in this test, so bump the limit.
+    "--max_cell_size_bytes=1000000"
+  };
+  AddFlagsForLogRolls(&ts_flags); // For CauseFollowerToFallBehindLogGC().
+  BuildAndStart(ts_flags);
 
   MonoDelta timeout = MonoDelta::FromSeconds(30);
 

http://git-wip-us.apache.org/repos/asf/kudu/blob/65ddfc50/src/kudu/integration-tests/test_workload.cc
----------------------------------------------------------------------
diff --git a/src/kudu/integration-tests/test_workload.cc b/src/kudu/integration-tests/test_workload.cc
index b674b17..5716718 100644
--- a/src/kudu/integration-tests/test_workload.cc
+++ b/src/kudu/integration-tests/test_workload.cc
@@ -165,7 +165,7 @@ void TestWorkload::WriteThread() {
           continue;
         }
 
-        CHECK_OK(s);
+        LOG(FATAL) << e->status().ToString();
       }
       inserted -= errors.size();
     }

http://git-wip-us.apache.org/repos/asf/kudu/blob/65ddfc50/src/kudu/integration-tests/ts_recovery-itest.cc
----------------------------------------------------------------------
diff --git a/src/kudu/integration-tests/ts_recovery-itest.cc b/src/kudu/integration-tests/ts_recovery-itest.cc
index 8ab57d1..151aa6b 100644
--- a/src/kudu/integration-tests/ts_recovery-itest.cc
+++ b/src/kudu/integration-tests/ts_recovery-itest.cc
@@ -16,6 +16,7 @@
 // under the License.
 
 #include "kudu/client/client.h"
+#include "kudu/integration-tests/cluster_itest_util.h"
 #include "kudu/integration-tests/cluster_verifier.h"
 #include "kudu/integration-tests/external_mini_cluster.h"
 #include "kudu/integration-tests/test_workload.h"
@@ -227,6 +228,59 @@ TEST_F(TsRecoveryITest, TestCrashBeforeWriteLogSegmentHeader) {
                                        MonoDelta::FromSeconds(60)));
 }
 
+// Test the following scenario:
+// - an insert is written to the WAL with a cell that is valid with the configuration
+//   at the time of write (eg because it's from a version of Kudu prior to cell size
+//   limits)
+// - the server is restarted with cell size limiting enabled (or lowered)
+// - the bootstrap should fail (but not crash) because it cannot apply the cell size
+// - if we manually increase the cell size limit again, it should replay correctly.
+TEST_F(TsRecoveryITest, TestChangeMaxCellSize) {
+  NO_FATALS(StartCluster({}));
+  TestWorkload work(cluster_.get());
+  work.set_num_replicas(1);
+  work.set_payload_bytes(10000);
+  work.Setup();
+  work.Start();
+  while (work.rows_inserted() < 100) {
+    SleepFor(MonoDelta::FromMilliseconds(100));
+  }
+  work.StopAndJoin();
+
+  // Restart the server with a lower value of max_cell_size_bytes.
+  auto* ts = cluster_->tablet_server(0);
+  ts->Shutdown();
+  ts->mutable_flags()->push_back("--max_cell_size_bytes=1000");
+  ASSERT_OK(ts->Restart());
+
+  // The bootstrap should fail and leave the tablet in FAILED state.
+  std::unordered_map<std::string, itest::TServerDetails*> ts_map;
+  ValueDeleter del(&ts_map);
+
+  ASSERT_OK(itest::CreateTabletServerMap(cluster_->master_proxy().get(),
+                                         cluster_->messenger(),
+                                         &ts_map));
+  AssertEventually([&]() {
+      vector<tserver::ListTabletsResponsePB::StatusAndSchemaPB> tablets;
+      ASSERT_OK(ListTablets(ts_map[ts->uuid()], MonoDelta::FromSeconds(10), &tablets));
+      ASSERT_EQ(1, tablets.size());
+      ASSERT_EQ(tablet::FAILED, tablets[0].tablet_status().state());
+      ASSERT_STR_CONTAINS(tablets[0].tablet_status().last_status(),
+                          "value too large for column");
+    });
+
+  // Restart the server again with the default max_cell_size.
+  // Bootstrap should succeed and all rows should be present.
+  ts->Shutdown();
+  ts->mutable_flags()->pop_back();
+  ASSERT_OK(ts->Restart());
+  ClusterVerifier v(cluster_.get());
+  NO_FATALS(v.CheckRowCountWithRetries(work.table_name(),
+                                       ClusterVerifier::EXACTLY,
+                                       work.rows_inserted(),
+                                       MonoDelta::FromSeconds(60)));
+}
+
 
 // A set of threads which pick rows which are known to exist in the table
 // and issue random updates against them.

http://git-wip-us.apache.org/repos/asf/kudu/blob/65ddfc50/src/kudu/tablet/tablet.cc
----------------------------------------------------------------------
diff --git a/src/kudu/tablet/tablet.cc b/src/kudu/tablet/tablet.cc
index f45d075..dfb42a5 100644
--- a/src/kudu/tablet/tablet.cc
+++ b/src/kudu/tablet/tablet.cc
@@ -114,6 +114,23 @@ DEFINE_int32(tablet_history_max_age_sec, 15 * 60,
              "To disable history removal, set to -1.");
 TAG_FLAG(tablet_history_max_age_sec, advanced);
 
+DEFINE_int32(max_cell_size_bytes, 64 * 1024,
+             "The maximum size of any individual cell in a table. Attempting to store "
+             "string or binary columns with a size greater than this will result "
+             "in errors.");
+TAG_FLAG(max_cell_size_bytes, unsafe);
+
+// Large encoded keys cause problems because we store the min/max encoded key in the
+// CFile footer for the composite key column. The footer has a max length of 64K, so
+// the default here comfortably fits two of them with room for other metadata.
+DEFINE_int32(max_encoded_key_size_bytes, 16 * 1024,
+             "The maximum size of a row's encoded composite primary key. This length is "
+             "approximately the sum of the sizes of the component columns, though it can
"
+             "be larger in cases where the components contain embedded NULL bytes. "
+             "Attempting to insert a row with a larger encoded composite key will "
+             "result in an error.");
+TAG_FLAG(max_encoded_key_size_bytes, unsafe);
+
 METRIC_DEFINE_entity(tablet);
 METRIC_DEFINE_gauge_size(tablet, memrowset_size, "MemRowSet Memory Usage",
                          kudu::MetricUnit::kBytes,
@@ -393,9 +410,78 @@ void Tablet::StartTransaction(WriteTransactionState* tx_state) {
   tx_state->SetMvccTx(std::move(mvcc_tx));
 }
 
+Status Tablet::ValidateInsertOrUpsertUnlocked(const RowOp& op) const {
+  // Check that no individual cell is larger than the specified max.
+  ConstContiguousRow row(schema(), op.decoded_op.row_data);
+  for (int i = 0; i < schema()->num_columns(); i++) {
+    if (!BitmapTest(op.decoded_op.isset_bitmap, i)) continue;
+    const auto& col = schema()->column(i);
+    if (col.type_info()->physical_type() != BINARY) continue;
+    const auto& cell = row.cell(i);
+    if (cell.is_nullable() && cell.is_null()) continue;
+    Slice s;
+    memcpy(&s, cell.ptr(), sizeof(s));
+    if (PREDICT_FALSE(s.size() > FLAGS_max_cell_size_bytes)) {
+      return Status::InvalidArgument(Substitute(
+          "value too large for column '$0' ($1 bytes, maximum is $2 bytes)",
+          col.name(), s.size(), FLAGS_max_cell_size_bytes));
+    }
+  }
+  // Check that the encoded key is not longer than the maximum.
+  auto enc_key_size = op.key_probe->encoded_key_slice().size();
+  if (PREDICT_FALSE(enc_key_size > FLAGS_max_encoded_key_size_bytes)) {
+    return Status::InvalidArgument(Substitute(
+        "encoded primary key too large ($0 bytes, maximum is $1 bytes)",
+        enc_key_size, FLAGS_max_cell_size_bytes));
+  }
+  return Status::OK();
+}
+
+Status Tablet::ValidateMutateUnlocked(const RowOp& op) const {
+  RowChangeListDecoder rcl_decoder(op.decoded_op.changelist);
+  RETURN_NOT_OK(rcl_decoder.Init());
+  if (rcl_decoder.is_reinsert()) {
+    // REINSERT mutations are the byproduct of an INSERT on top of a ghost
+    // row, not something the user is allowed to specify on their own.
+    return Status::InvalidArgument("User may not specify REINSERT mutations");
+  }
+
+  if (rcl_decoder.is_delete()) {
+    // Don't validate the composite key length on delete. This is important to allow users
+    // to delete a row if a row with a too-large key was inserted on a previous version
+    // that had no limits.
+    return Status::OK();
+  }
+
+  // For updates, just check the new cell values themselves, and not the row key,
+  // following the same logic.
+  while (rcl_decoder.HasNext()) {
+    RowChangeListDecoder::DecodedUpdate cell_update;
+    RETURN_NOT_OK(rcl_decoder.DecodeNext(&cell_update));
+    if (cell_update.null) continue;
+    Slice s = cell_update.raw_value;
+    if (PREDICT_FALSE(s.size() > FLAGS_max_cell_size_bytes)) {
+      const auto& col = schema()->column_by_id(cell_update.col_id);
+      return Status::InvalidArgument(Substitute(
+          "value too large for column '$0' ($1 bytes, maximum is $2 bytes)",
+          col.name(), s.size(), FLAGS_max_cell_size_bytes));
+
+    }
+  }
+  return Status::OK();
+}
+
 Status Tablet::InsertOrUpsertUnlocked(WriteTransactionState *tx_state,
                                       RowOp* op,
                                       ProbeStats* stats) {
+
+  Status s = ValidateInsertOrUpsertUnlocked(*op);
+  if (PREDICT_FALSE(!s.ok())) {
+    // TODO(todd): add a metric tracking the number of invalid ops.
+    op->SetFailed(s);
+    return s;
+  }
+
   const bool is_upsert = op->decoded_op.type == RowOperationsPB::UPSERT;
   const TabletComponents* comps = DCHECK_NOTNULL(tx_state->tablet_components());
 
@@ -425,7 +511,7 @@ Status Tablet::InsertOrUpsertUnlocked(WriteTransactionState *tx_state,
 
   // Now try to op into memrowset. The memrowset itself will return
   // AlreadyPresent if it has already been oped there.
-  Status s = comps->memrowset->Insert(ts, row, tx_state->op_id());
+  s = comps->memrowset->Insert(ts, row, tx_state->op_id());
   if (s.ok()) {
     op->SetInsertSucceeded(comps->memrowset->mrs_id());
   } else {
@@ -532,23 +618,15 @@ vector<RowSet*> Tablet::FindRowSetsToCheck(RowOp* op,
 Status Tablet::MutateRowUnlocked(WriteTransactionState *tx_state,
                                  RowOp* mutate,
                                  ProbeStats* stats) {
-  gscoped_ptr<OperationResultPB> result(new OperationResultPB());
-
-  const TabletComponents* comps = DCHECK_NOTNULL(tx_state->tablet_components());
-
-  // Validate the update.
-  RowChangeListDecoder rcl_decoder(mutate->decoded_op.changelist);
-  Status s = rcl_decoder.Init();
-  if (rcl_decoder.is_reinsert()) {
-    // REINSERT mutations are the byproduct of an INSERT on top of a ghost
-    // row, not something the user is allowed to specify on their own.
-    s = Status::InvalidArgument("User may not specify REINSERT mutations");
-  }
+  // Validate the mutation.
+  Status s = ValidateMutateUnlocked(*mutate);
   if (!s.ok()) {
     mutate->SetFailed(s);
     return s;
   }
 
+  gscoped_ptr<OperationResultPB> result(new OperationResultPB());
+  const TabletComponents* comps = DCHECK_NOTNULL(tx_state->tablet_components());
   Timestamp ts = tx_state->timestamp();
 
   // First try to update in memrowset.

http://git-wip-us.apache.org/repos/asf/kudu/blob/65ddfc50/src/kudu/tablet/tablet.h
----------------------------------------------------------------------
diff --git a/src/kudu/tablet/tablet.h b/src/kudu/tablet/tablet.h
index 994f98e..ea2bb42 100644
--- a/src/kudu/tablet/tablet.h
+++ b/src/kudu/tablet/tablet.h
@@ -367,6 +367,14 @@ class Tablet {
 
   Status FlushUnlocked();
 
+  // Validate the given insert/upsert operation. In particular, checks that the size
+  // of any cells is not too large given the configured maximum on the server, and
+  // that the encoded key is not too large.
+  Status ValidateInsertOrUpsertUnlocked(const RowOp& op) const;
+
+  // Validate the given update/delete operation. In particular, validates that no
+  // cell is being updated to an invalid (too large) value.
+  Status ValidateMutateUnlocked(const RowOp& op) const;
 
   // Perform an INSERT or UPSERT operation, assuming that the transaction is already in
   // prepared state. This state ensures that:

http://git-wip-us.apache.org/repos/asf/kudu/blob/65ddfc50/src/kudu/tablet/tablet_bootstrap.cc
----------------------------------------------------------------------
diff --git a/src/kudu/tablet/tablet_bootstrap.cc b/src/kudu/tablet/tablet_bootstrap.cc
index c8cf788..80e163d 100644
--- a/src/kudu/tablet/tablet_bootstrap.cc
+++ b/src/kudu/tablet/tablet_bootstrap.cc
@@ -1248,6 +1248,7 @@ Status TabletBootstrap::PlayWriteRequest(ReplicateMsg* replicate_msg,
     result_tracker_->RecordCompletionAndRespond(replicate_msg->request_id(), response.get());
   }
 
+  Status play_status;
   bool all_already_flushed = std::all_of(already_flushed.begin(),
                                          already_flushed.end(),
                                          [](bool f) { return f; });
@@ -1259,20 +1260,33 @@ Status TabletBootstrap::PlayWriteRequest(ReplicateMsg* replicate_msg,
     }
   } else {
     if (write->has_row_operations()) {
-      // TODO: get rid of redundant params below - they can be gotten from the Request
-      RETURN_NOT_OK(PlayRowOperations(&tx_state,
+      // TODO(todd): get rid of redundant params below - they can be gotten from the Request
+      // Rather than RETURN_NOT_OK() here, we need to just save the status and do the
+      // RETURN_NOT_OK() down below the Commit() call below. Even though it seems wrong
+      // to commit the transaction when in fact it failed to apply, we would throw a CHECK
+      // failure if we attempted to 'Abort()' after entering the applying stage. Allowing
it to
+      // Commit isn't problematic because we don't expose the results anyway, and the bad
+      // Status returned below will cause us to fail the entire tablet bootstrap anyway.
+      play_status = PlayRowOperations(&tx_state,
                                       write->schema(),
                                       write->row_operations(),
                                       commit_msg.result(),
-                                      already_flushed));
+                                      already_flushed);
+    }
+
+    if (play_status.ok()) {
+      // Replace the original commit message's result with the new one from
+      // the replayed operation.
+      tx_state.ReleaseTxResultPB(commit->mutable_result());
     }
-    // Replace the original commit message's result with the new one from
-    // the replayed operation.
-    tx_state.ReleaseTxResultPB(commit->mutable_result());
   }
 
   tx_state.CommitOrAbort(Transaction::COMMITTED);
 
+  // If we failed to apply the operations, fail bootstrap before we write anything incorrect
+  // to the recovery log.
+  RETURN_NOT_OK(play_status);
+
   RETURN_NOT_OK(log_->Append(&commit_entry));
 
   return Status::OK();

http://git-wip-us.apache.org/repos/asf/kudu/blob/65ddfc50/src/kudu/tserver/tablet_server-test.cc
----------------------------------------------------------------------
diff --git a/src/kudu/tserver/tablet_server-test.cc b/src/kudu/tserver/tablet_server-test.cc
index f65334e..a82ed4b 100644
--- a/src/kudu/tserver/tablet_server-test.cc
+++ b/src/kudu/tserver/tablet_server-test.cc
@@ -324,8 +324,10 @@ TEST_F(TabletServerTest, TestInsert) {
   }
 
   // Send a batch with multiple rows, one of which is a duplicate of
-  // the above insert. This should generate one error into per_row_errors.
+  // the above insert, and one of which has a too-large value.
+  // This should generate two errors into per_row_errors.
   {
+    const string kTooLargeValue(100 * 1024, 'x');
     controller.Reset();
     RowOperationsPB* data = req.mutable_row_operations();
     data->Clear();
@@ -333,15 +335,23 @@ TEST_F(TabletServerTest, TestInsert) {
     AddTestRowToPB(RowOperationsPB::INSERT, schema_, 1, 1, "ceci n'est pas une dupe", data);
     AddTestRowToPB(RowOperationsPB::INSERT, schema_, 2, 1, "also not a dupe key", data);
     AddTestRowToPB(RowOperationsPB::INSERT, schema_, 1234, 1, "I am a duplicate key", data);
+    AddTestRowToPB(RowOperationsPB::INSERT, schema_, 3, 1, kTooLargeValue, data);
     SCOPED_TRACE(req.DebugString());
     ASSERT_OK(proxy_->Write(req, &resp, &controller));
     SCOPED_TRACE(resp.DebugString());
     ASSERT_FALSE(resp.has_error()) << resp.ShortDebugString();
-    ASSERT_EQ(1, resp.per_row_errors().size());
+    ASSERT_EQ(3, rows_inserted->value());  // This counter only counts successful inserts.
+    ASSERT_EQ(2, resp.per_row_errors().size());
+
+    // Check the duplicate key error.
     ASSERT_EQ(2, resp.per_row_errors().Get(0).row_index());
     Status s = StatusFromPB(resp.per_row_errors().Get(0).error());
     ASSERT_STR_CONTAINS(s.ToString(), "Already present");
-    ASSERT_EQ(3, rows_inserted->value());  // This counter only counts successful inserts.
+
+    // Check the value-too-large error.
+    ASSERT_EQ(3, resp.per_row_errors().Get(1).row_index());
+    s = StatusFromPB(resp.per_row_errors().Get(1).error());
+    ASSERT_STR_CONTAINS(s.ToString(), "Invalid argument");
   }
 
   // get the clock's current timestamp
@@ -598,6 +608,7 @@ TEST_F(TabletServerTest, TestInsertAndMutate) {
 
   // Do a mixed operation (some insert, update, and delete, some of which fail)
   {
+    const string kTooLargeValue(100 * 1024, 'x');
     WriteRequestPB req;
     WriteResponsePB resp;
     req.set_tablet_id(kTabletId);
@@ -612,16 +623,22 @@ TEST_F(TabletServerTest, TestInsertAndMutate) {
     AddTestKeyToPB(RowOperationsPB::DELETE, schema_, 5, ops);
     // op 3: Insert a new row 6 (succeeds)
     AddTestRowToPB(RowOperationsPB::INSERT, schema_, 6, 6, "new row 6", ops);
+    // op 4: update a row with a too-large value (fail)
+    AddTestRowToPB(RowOperationsPB::UPDATE, schema_, 4, 6, kTooLargeValue, ops);
 
     SCOPED_TRACE(req.DebugString());
     ASSERT_OK(proxy_->Write(req, &resp, &controller));
     SCOPED_TRACE(resp.DebugString());
     ASSERT_FALSE(resp.has_error())<< resp.ShortDebugString();
-    ASSERT_EQ(2, resp.per_row_errors().size());
+    ASSERT_EQ(3, resp.per_row_errors().size());
     EXPECT_EQ("row_index: 0 error { code: NOT_FOUND message: \"key not found\" }",
               resp.per_row_errors(0).ShortDebugString());
     EXPECT_EQ("row_index: 2 error { code: NOT_FOUND message: \"key not found\" }",
               resp.per_row_errors(1).ShortDebugString());
+    EXPECT_EQ("row_index: 4 error { code: INVALID_ARGUMENT message: "
+              "\"value too large for column \\'string_val\\' (102400 bytes, "
+              "maximum is 65536 bytes)\" }",
+              resp.per_row_errors(2).ShortDebugString());
     controller.Reset();
   }
 


Mime
View raw message