hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aleksandr Shulman <al...@cloudera.com>
Subject Let's discuss Snapshots Feature Testing
Date Mon, 14 Jan 2013 18:32:02 GMT
Hi everyone,

I'd like to start a thread about Cloudera's testing efforts on the upcoming
snapshots feature. This is a new feature and it's important that we explain
our testing efforts and get the community's opinion on what we'd all like
to see tested. My hope is that from this discussion, we can get more ideas
about what needs to be tested and gain confidence in the testing we have in

Before I begin, I'd like to introduce myself. I'm Aleks Shulman. I'm a
software engineer at Cloudera, working primarily on HBase. Within HBase, I
am focusing on the quality side of things. What this means to me is an
conversation unto itself, but in brief, I will be writing tests and test
frameworks. I will also be an advocate for the user experience, with
particular focus on API compatibility and ease-of-use.

So let's discuss snapshots:
There are two main areas that should be tested and they correspond nicely
into what can be done as unit tests and what is better left as Jenkins job
or some other automation, unit testing and non-unit testing. We've been
working on this for a bit, so there is already some progress in these areas:

Unit testing - In progress or completed:

1. HBase Snapshots Repeatability and Idempotency Test:
This test class verifies proper behavior with regards performing
restore/clone operations on tables that themselves were created as a clone
or restored from a snapshot. This is an interesting set of cases because of
the way snapshots work. They work by pointing to the original HFiles.
We can use these tests to verify correctness in the file system and test
closure under deletion of the original table.

2. HBase Snapshots HTable Descriptor Test
This test class verifies proper behavior with regards to changes to the
information about the table itself before and after snapshotting in the
'before' table and the 'after' table.

3. HBase Snapshots HFileLink Test
This test class inspects the correctness of the HFileLink files. It looks
into their permissioning, the naming convention, and how they respond
events. Events may include an HFile being deleted or moved.

4. HBase Snapshots Table Dimensions Test
This test class inspects operations on tables that are empty, have only one
row, have one or two CFs, etc. Basically if there is an edge scenario in
what the table looks like, that may affect the way it snapshotted or

5. HBase Snapshots Independence Test
This test should verify that all aspects of table independence are
guaranteed between the original table and the restored snapshot/clone.
This includes things like data mutations, compactions, splits, etc. It also
includes metadata changes.

6.  HBase Snapshots Aborted or Failed Snapshot Cleanup
Verifies that no cruft is left over after an attempt to snapshot a table
fails or is aborted. We should be able to account for every file in the
file system before and after.

7. HBase Snapshots HFile Archive Test
This test task is to fill in any gaps in testing of archiving as it relates
to snapshots. The snapshots relies on the HFileArchiver/LogArchiver with
two new cleaners (SnapshotHFile/SnapshotLog Cleaners), so we'd need to go
through and find out what needs to be tested between them.

8. HBase Snapshots Export Test
This test should verify that export of a snapshot to another cluster works
Implemented as: mvn clean test -PlocalTests
However, we need to add more test around chmod, chown and checksums

9. HBase Snapshots Concurrent Snapshots Test
This test class will enforce proper behavior in situations where race
conditions can occur. For example, if one process attempts to restore a
table and another one tries to do so simultaneously, what happens? We need
to know how dangerous this could be and whether it is possible for data to
be lost.
Covered in HBASE-7536.

Unit testing - Lightly tested so far, or tests we are hoping to write soon:

1. HBase Snapshots File System Correctness Tests -

This test class verifies proper behavior with regards to what the file
system looks like. What the file system contains should be predictable
after certain events, both snapshot-specific and environment-specific.
For example, after a snapshot, we should expect there to be files in the
/hbase/.snapshot/ folder. Also, after a split occurs on the base table and
the underlying HFiles go through flux, we should be able to know beforehand
where files move. In particular, this is important to test after repeated
deletions and modifications. Also -- we want to make sure no cruft remains
after various operations occur.

2. HBase Snapshots (Re)Naming Test [Note: Renaming snapshots is not
supported yet!]

These tests should verify valid/invalid names for snapshots. In particular,
it should use the rename_snapshot command to attempt to rename to a table
that already exists, or to a snapshot that already exists (or had existed
but was deleted).
Things like special characters or semantically-meaningful characters are
important as well. Other things that need to be tested are what happens if
a snapshot is created, deleted, the underlying table is modified, and then
another snapshot is taken. The snapshot should contain the most recent data.

3. Snapshots logline test:
Verifies that the proper loglines are generated for events.
Manual testing for this might include making sure that spurious,
misleading, or unnecessary log lines are not present.

4. HBase Snapshots Aborted or Failed Clone or Restore

Verifies that no cruft is left over after an attempt to restore or clone a
snapshotted table fails or is aborted and that further snapshots can take
place. This may be tricky and could require writing some additional

Non-unit testing:

This area of testing is less straightforward and more exploratory in
nature. It's open-ended but with some direction. Particularly, we want to
test a lot of "what if this happens when we do something related
snapshots". By "this happens", I mean compactions, splits, processes dying,
master failing over to backup master, etc. By "something related to
snapshots", that could mean taking a snapshot, restoring a snapshot, or
cloning a snapshot, among other things. In addition, we can see what
happens as scaling factors, (e.g. the number of regions, amount of data per
node, duration of test, and frequency of compactions/splits) increases.
Finally, we should benchmark the time it takes to take/restore/clone a
snapshot and see how it changes with scale factors.

We are testing some of these combination internally. When we see something
go awry, we fix and rerun the trial, with the expectation that the feature
becomes more stable and reliant.

Some of the things we have tried:
-Long running tests: Run repeated snapshots while verifying that all is

-Meanness tets:
1. Killing the master
2. Performing a compaction
3. Table enable/disable

Feel free to follow-up with questions.

Best Regards,

Aleks Shulman

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message