hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HBASE-12782) ITBLL fails for me if generator does anything but 5M per maptask
Date Fri, 30 Jan 2015 20:45:35 GMT

     [ https://issues.apache.org/jira/browse/HBASE-12782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

stack updated HBASE-12782:
    Attachment: 12782v2.txt

Looks like this fix helps alot. I ran my rig and it passed (9 times out of ten it does not).
 I then doubled up the counts so we did 250M instead of 125M and again it passed.  Will run
some bigger tests over w/e.

Here is the patch I'd like to apply. It has the fix, an obnoxious unit test to verify the
fix, and then the tooling I used to find the issue.  That patch is fat because it includes
a big data file of recovered.edits to replay in the unit test.

Patch changes ITBLL to add better logging with more data around missing rows. It also amends
the verify step in ITBLL to emit the binary missing along w/ the type of the missing data.
This output is then useable by a new tool, a search, which takes the missing rows from verify
and then goes off to search WALs and oldWALs. This latter tool was good for figuring where
the edits had gone missing (ante- or post-WAL).

The search tool emits each time it finds a key.  This was useful narrowing in on the WALs
that had the rows that  were missing.

I'd then take the name of the WAL that had the edits and then go look at its provenance. 
In this case, the WALs were opened just before a crash and no flush had happened.  The WALs
would then be split to produce recovered.edits.

The patch includes a means of having recovered.edits files moved to archive when done rather
than delete (This is a change in HRegion).  This was useful for checking if the WAL split
had actually moved the missing edits from WAL to recovered.edits. It had in this case, so
then the replay of edits was suspect (of note, the recovered.edits files can be viewed with
the WALPrettyPrinter -- which also has some improvements courtesy of this patch).

WALPlayer is used by the search tool in ITBLL.  Added a filter method so I could use the WALPlayer
near directly when searching.

Made removing of files from archive or wherever DEBUG level rather than TRACE.

Made a minor improvement to recovered edits replay checking at the WALEdit level if the edit
is for THIS region rather than doing the check per Cell. It will help some with the likes
of the recovered.edits files I was seeing in my cluster testing where a single WALEdit had
hundreds of Cells in it.

The actual fix in HRegion was a simple one-liner (see above).

> ITBLL fails for me if generator does anything but 5M per maptask
> ----------------------------------------------------------------
>                 Key: HBASE-12782
>                 URL: https://issues.apache.org/jira/browse/HBASE-12782
>             Project: HBase
>          Issue Type: Bug
>          Components: integration tests
>    Affects Versions: 1.0.0
>            Reporter: stack
>            Priority: Critical
>             Fix For: 1.0.1
>         Attachments: 12782.fix.txt, 12782.search.plus.archive.recovered.edits.txt, 12782.search.plus.txt,
12782.search.txt, 12782.unit.test.and.it.test.txt, 12782.unit.test.writing.txt, 12782v2.txt
> Anyone else seeing this?  If I do an ITBLL with generator doing 5M rows per maptask,
all is good -- verify passes. I've been running 5 servers and had one splot per server.  So
below works:
> HADOOP_CLASSPATH="/home/stack/conf_hbase:`/home/stack/hbase/bin/hbase classpath`" ./hadoop/bin/hadoop
--config ~/conf_hadoop org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList --monkey
serverKilling Generator 5 5000000 g1.tmp
> or if I double the map tasks, it works:
> HADOOP_CLASSPATH="/home/stack/conf_hbase:`/home/stack/hbase/bin/hbase classpath`" ./hadoop/bin/hadoop
--config ~/conf_hadoop org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList --monkey
serverKilling Generator 10 5000000 g2.tmp
> ...but if I change the 5M to 50M or 25M, Verify fails.
> Looking into it.

This message was sent by Atlassian JIRA

View raw message