hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erdem Agaoglu <erdem.agao...@gmail.com>
Subject Re: Node failure causes weird META data?
Date Fri, 05 Nov 2010 18:45:44 GMT
Previous problems probably correlate. When log split and replay fails
[because of the empty log], master thinks the cluster is in a clean state
and assigns a new .META. on some regionserver. It does not know about the
previous assignment of .META. thus does not split any logs for it. When
regionserver opens new .META., it looks to hdfs and finds some persistent
file for it, and since there is no recovered.edits it starts to serve.
Depending on the state those files 2 things happen:

1. If the cluster is really new (like an hour new) all the .META. data will
be on previous RS memstore. So there will be no accessible regions = lost
tables
2. If the cluster is running for a while or gracefully restarted before, the
.META. data from the previous run would be persisted. So the regions from
this previous run will be accessible but not the newly generated or
re-assigned regions = unfinishing scans or corrupt data
*. No matter what, since master thinks the .META. is new, it will not know
about and unassign original regions = double assignment

So if we are right about those assumptions, it seems it's the same problem.

We've learned a lot about HBase this week, and we owe it to your pointers,
thanks.
I am grateful have the chance to contribute to the community in any way,...
at last :).

On Fri, Nov 5, 2010 at 7:43 PM, Jean-Daniel Cryans <jdcryans@apache.org>wrote:

> Very interesting finding Erdem! I tried it on a local empty instance
> and I set the roll period to 6 secs, here's what I saw:
>
> org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 6000ms
> elapsed
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using
> syncFs -- HDFS-200
> org.apache.hadoop.hbase.regionserver.wal.HLog: Roll
> /hbase-89-su/.logs/hbasedev,60020,1288977933643/10.10.1.177
> %3A60020.1288977933829,
> entries=1, filesize=295. New hlog
> /hbase-89-su/.logs/hbasedev,60020,1288977933643/10.10.1.177
> %3A60020.1288977943913
> org.apache.hadoop.hbase.regionserver.wal.HLog: Found 1 hlogs to remove
>  out of total 1; oldest outstanding sequenceid is 270055 from region
> -ROOT-,,0
> org.apache.hadoop.hbase.regionserver.wal.HLog: moving old hlog file
> /hbase-89-su/.logs/hbasedev,60020,1288977933643/10.10.1.177
> %3A60020.1288977933829
> whose highest sequenceid is 270054 to
> /hbase-89-su/.oldlogs/10.10.1.177%3A60020.1288977933829
>
> This looks like a very bad bug, I opened
> https://issues.apache.org/jira/browse/HBASE-3198
>
> Can you correlate this with the previous tests you did? Hopefully it's
> the same problem.
>
> Thanks a lot for your patience and investigative work!
>
> J-D
>
> On Fri, Nov 5, 2010 at 9:28 AM, Erdem Agaoglu <erdem.agaoglu@gmail.com>
> wrote:
> > We have found the problem, again. But this time, we dug deeper and
> extracted
> > more info.
> >
> > First, the reason that we were unable to reproduce was trying to kill the
> RS
> > that holds -ROOT- too soon. That way,
> > 1. log split goes fine,
> > 2. -ROOT- gets assigned on somewhere else,
> > 3. log replaying for that RS goes smoothly
> > 4. .META. does not change.
> > 5. everything is normal
> >
> > But if we try after a while of running the cluster,
> > 1. log split fails with java.io.EOFException for the only hlog file,
> > 2. -ROOT- gets assigned on somewhere else,
> > 3. no log is replayed since none split
> > 4. .META. is invalid, a new .META. is designated
> > 5. data corrupted
> >
> > Before killing RS that holds -ROOT-, we checked the only hlog file of it,
> > and see that it contains the updates for the place of the .META. in the
> > first scenario. But in the second failing scenario, the file was empty.
> We
> > found that the log rolling process, which happens once an hour by
> default,
> > somehow thinks that the hlog is old and cleans it up.
> >
> > # LogRoller: Hlog roll period 3600000ms elapsed
> > # SequenceFileLogWriter: Using syncFs -- HDFS-200
> > # HLog: Roll /hbase/.logs/A/blahfile0316, entries=1, filesize=312. New
> hlog
> > /hbase/.logs/A/blahfile0476
> > # HLog: Found 1 hlogs to remove  out of total 1; oldest outstanding
> > sequenceid is 4 from region -ROOT-,,0
> > # HLog: moving old hlog file /hbase/.logs/A/blahfile0316 whose highest
> > sequenceid is 3 to /hbase/.oldlogs/blahfile0316
> >
> > In that instance, blahfile0316 contains some rows while blahfile0476 is
> > empty. It very much seems like the problem is here, as it thinks the data
> in
> > memstore to be persisted while it is not. This seemed like a bug to us
> but
> > might as well be a misconfiguration. Any idea is greatly appreciated.
> >
> >
> > On Wed, Nov 3, 2010 at 4:36 PM, Erdem Agaoglu <erdem.agaoglu@gmail.com
> >wrote:
> >
> >> We suspect some misconfiguration too but are unsure what that might be.
> >> Almost all of the configuration parameters are in default along with
> >> dfs.support.append which is true in append branch. We checked RS logs
> and
> >> couldn't find any Exceptions, everything seems normal so it is unlikely
> to
> >> be a bug. The only WARN was about the last hlog file being empty but i
> guess
> >> this is normal.
> >>
> >> To confirm it is a append-configuration related problem, we killed the
> RS
> >> that holds -ROOT- right after running the cluster and checked the
> >> /hbase/.oldlogs/. There was a file containing a line for -ROOT- so i
> think
> >> append works correctly. The file got removed after a while.
> >>
> >> BTW, we tried to reproduce it with smaller data (to control manually,
> like
> >> 2 rows in 1 table, newly put) but didn't have any problems. I am
> starting to
> >> think the problem just disappeared. I'll increase rows to hundreds and
> check
> >> again.
> >>
> >> Until i get the problem again, Thanks for your time.
> >>
> >> On Tue, Nov 2, 2010 at 7:28 PM, Jean-Daniel Cryans <jdcryans@apache.org
> >wrote:
> >>
> >>> This confirms my suspicion that -ROOT-'s latest data was lost and that
> >>> .META. got double assigned, which easily explains all the weirdness
> >>> you're seeing after the log replay.
> >>>
> >>> The most probable reason would be misconfiguration, since the append
> >>> feature is very well tested and a simple kill -9 wouldn't cause all
> >>> sorts of issues. Else, it could be caused by an HBase bug, in which
> >>> case exceptions would probably come out either during log splitting or
> >>> during log replay (which happens when the region is reassigned to
> >>> another region server). In your case you could take a look at the RS
> >>> log where -ROOT- ends up and see if there's anything weird.
> >>>
> >>> To test if appends works correctly, in order to rule that out, just
> >>> bring up your cluster and kill -9 right away the RS that's holding
> >>> ROOT. Once the master did the log splitting, do a lookup directly in
> >>> hadoop (bin/hadoop fs -cat) on that log file which should be under
> >>> /hbase/.oldlogs/serverstring/somelog and look for that -ROOT- entry.
> >>> If it's empty or missing, it's probably a misconfiguration.
> >>>
> >>> J-D
> >>>
> >>> On Tue, Nov 2, 2010 at 9:02 AM, Erdem Agaoglu <erdem.agaoglu@gmail.com
> >
> >>> wrote:
> >>> > We reproduced the problem by kill -9ing the region server that hosts
> >>> -ROOT-
> >>> > again. Results are the same with what i have explained before. After
> log
> >>> > splitting metaScanner complained that .META. is not valid and
> reassigned
> >>> it
> >>> > to somewhere else. As a result, the number of rows in .META. reduced
> >>> from 24
> >>> > to 16 meaning we lost access to some regions and some table
> definitions.
> >>> To
> >>> > support what you have said about the data in the memstore, we seem
to
> >>> have
> >>> > lost only recent regions.
> >>> >
> >>> > status 'detailed' showed no regionsInTransition, and showed a total
> of
> >>> 39
> >>> > regions assigned to region servers, but as i've said before, we have
> >>> access
> >>> > to only 16 of them. No need to mention our queries fail.
> >>> >
> >>> > Any idea we could try is greatly appreciated. Thanks in advance.
> >>> >
> >>> > PS. Something probably irrelevant happened this time, interface 60010
> >>> lists
> >>> > all our tables while 'list' command in shell lists only a few. We get
> >>> > TableNotFoundExceptions on tables shown in web interface but missing
> in
> >>> > 'list' command output. Seems like web interface is able to list
> >>> inaccessible
> >>> > tables.
> >>> >
> >>> > On Tue, Nov 2, 2010 at 9:56 AM, Erdem Agaoglu <
> erdem.agaoglu@gmail.com
> >>> >wrote:
> >>> >
> >>> >> By "lost tables" i mean no user table definitions were listed in
> >>> interface
> >>> >> 60010 and my queries got me TableNotFoundExceptions. Routine
> >>> BaseScanner
> >>> >> logged like
> >>> >>
> >>> >> # RegionManager.metaScanner scan of 0 row(s) of meta region ...
> >>> complete
> >>> >>
> >>> >> so i guess my .META. was empty. But unfortunately I don't know
if
> any
> >>> >> regions were stuck in transition, i'll be sure to check this out
> while
> >>> i try
> >>> >> to reproduce.
> >>> >>
> >>> >> I rechecked the logs, specifically after the splitting completed
and
> it
> >>> >> contains lines like "Current assignment is not valid..." so i guess
> >>> this is
> >>> >> something unexpected. But in hopes of some configuration error
you
> may
> >>> spot,
> >>> >> whole log goes like that (stripped to be more readable):
> >>> >>
> >>> >> # hlog file splitting completed in 80372 millis
> >>> >> # Log split complete, meta reassignment and scanning:
> >>> >> # ProcessServerShutdown reassigning ROOT region
> >>> >> # -ROOT- region unset (but not set to be reassigned)
> >>> >> # ROOT inserted into regionsInTransition
> >>> >> # Assigning for serverName=C...  total nregions to assign=1, regions
> to
> >>> >> give other servers than this=0, isMetaAssign=true
> >>> >> # Assigning serverName=C... 1 regions
> >>> >> # Assigning region -ROOT-... to C...
> >>> >> # Processing MSG_REPORT_OPEN: -ROOT-... from C...; 1 of 1
> >>> >> # RegionManager.rootScanner scanning meta region {server: C ...
> >>> >> # Current assignment of .META.... is not valid;  serverAddress=A,
> >>> >> startCode=blah unknown; checking once more!
> >>> >> # Current assignment of .META.... is not valid;  serverAddress=A,
> >>> >> startCode=blah unknown.
> >>> >> # RegionManager.rootScanner scan of 1 row(s) of meta region {server:
> C
> >>> ...
> >>> >> complete
> >>> >> # RegionManager.rootScanner scanning meta region {server: C ...
> >>> >> # RegionManager.rootScanner scan of 1 row(s) of meta region {server:
> C
> >>> ...
> >>> >> complete
> >>> >> # Assigning for serverName=E...  total nregions to assign=-1,
> regions
> >>> to
> >>> >> give other servers than this=2, isMetaAssign=true
> >>> >> # Assigning serverName=E... -1 regions
> >>> >> # Assigning region .META.... to E...
> >>> >> # Processing MSG_REPORT_OPEN: .META.... from E...; 1 of 1
> >>> >> # Processing todo: PendingOpenOperation from E...
> >>> >> # .META.... open on E
> >>> >> # Updated row .META.... in region -ROOT- ...
> >>> >> # Adding to onlineMetaRegions: {server: E ...
> >>> >> # RegionManager.metaScanner scanning meta region {server: E ...
> >>> >> # RegionManager.metaScanner scan of 0 row(s) of meta region {server:
> E
> >>> ...
> >>> >> # All 1 .META. region(s) scanned
> >>> >> 10 secs later
> >>> >> # Processing todo: ProcessServerShutdown of F
> >>> >> # Process shutdown of server F...: logSplit: true, rootRescanned:
> >>> false,
> >>> >> numberOfMetaRegions: 1, onlineMetaRegions.size(): 1
> >>> >> # Log split complete, meta reassignment and scanning
> >>> >> # Process server shutdown scanning root region on C
> >>> >> # Process server shutdown scanning root region on C finished master
> >>> >> # process server shutdown scanning .META.,,1 on E
> >>> >> # process server shutdown finished scanning .META.,,1 on E
> >>> >> # Removed F... from deadservers Map
> >>> >>
> >>> >> Thanks again.
> >>> >>
> >>> >> On Mon, Nov 1, 2010 at 6:58 PM, Jean-Daniel Cryans <
> >>> jdcryans@apache.org>wrote:
> >>> >>
> >>> >>> The fact that the tables are "revived" is a clue here IMO,
but
> let's
> >>> >>> go back to more basic questions...
> >>> >>>
> >>> >>> So when you say that during step 1 you lost tables, what do
you
> mean
> >>> >>> by "lost"? Were the rows of those tables still in .META.? Were
the
> >>> >>> regions stuck in transition (in the shell, do: status 'detailed')?
> Or
> >>> >>> when you tried to query them you just got a TableNotFoundException?
> >>> >>>
> >>> >>> Also, the fact that only -ROOT- and not .META. was on this
region
> >>> >>> server means that if there was any data lost, it would be .META.'s
> >>> >>> location and it would have been assigned somewhere else (E),
but
> still
> >>> >>> stayed assigned on A. Since the data is in the memstore, recent
> data
> >>> >>> couldn't be read by this second assignment of .META. but...
it
> could
> >>> >>> also be reassigned for a "normal" reason like rebalancing.
The best
> >>> >>> way to confirm that is when the -ROOT- region gets reassigned
at
> the
> >>> >>> end of step 1 (so this is after the message that goes like
"...file
> >>> >>> splitting completed in 80372..."), do so see something like
this in
> >>> >>> the master's log: "Current assignment of .META.,,some_timestamp
 is
> >>> >>> not valid; serverAddress=blah, startcode=blah unknown."? If
so,
> then
> >>> >>> it seems that data was lost and this is really unexpected.
> >>> >>>
> >>> >>> J-D
> >>> >>>
> >>> >>> On Mon, Nov 1, 2010 at 1:36 AM, Erdem Agaoglu <
> >>> erdem.agaoglu@gmail.com>
> >>> >>> wrote:
> >>> >>> > Hi again,
> >>> >>> >
> >>> >>> > I have re-checked our configuration to confirm that we
have
> >>> >>> > dfs.support.append enabled and saw "Using syncFs -- HDFS-200"
in
> >>> logs. I
> >>> >>> > inspected logs around log splits to find something, but
i'm not
> sure
> >>> >>> that's
> >>> >>> > what we are looking for. In the first step of the scenario
i
> >>> mentioned
> >>> >>> > before (where we kill -9ed everything on the node that
hosts the
> >>> ROOT
> >>> >>> > region), HLog says (stripping hdfs:// prefixes and hostnames
for
> >>> >>> clarity)
> >>> >>> >
> >>> >>> > # Splitting 7 hlog(s) in .logs/F,60020,1287491528908
> >>> >>> >
> >>> >>> > Then it goes over every single one like
> >>> >>> >
> >>> >>> > # Splitting hlog 1 of 7
> >>> >>> > # Splitting hlog 2 of 7
> >>> >>> > # ...
> >>> >>> > # Splitting hlog 7 of 7
> >>> >>> >
> >>> >>> > On the 7th hlog it WARNs with two lines
> >>> >>> >
> >>> >>> > # File .logs/F,60020,1287491528908/10.1.10.229
> >>> %3A60020.1288021443546
> >>> >>> might
> >>> >>> > be still open, length is 0
> >>> >>> > # Could not open .logs/F,60020,1287491528908/10.1.10.229
> >>> >>> %3A60020.1288021443546
> >>> >>> > for reading. File is emptyjava.io.EOFException
> >>> >>> >
> >>> >>> > And completes with
> >>> >>> >
> >>> >>> > # log file splitting completed in 80372 millis for
> >>> >>> > .logs/F,60020,1287491528908
> >>> >>> >
> >>> >>> > This might be it, but on the sixth step (where we kill
-9ed the
> >>> >>> RegionServer
> >>> >>> > that hosts the only META region), it splits 2 hlogs without
any
> >>> empty
> >>> >>> file
> >>> >>> > problems nor any log above INFO, but as i told before,
our
> testtable
> >>> >>> still
> >>> >>> > got lost.
> >>> >>> >
> >>> >>> > I'll try to reproduce the problem in a cleaner way, but
in the
> >>> meantime,
> >>> >>> any
> >>> >>> > kind of pointers to any problems we might have is greatly
> >>> appreciated.
> >>> >>> >
> >>> >>> >
> >>> >>> > On Fri, Oct 29, 2010 at 9:25 AM, Erdem Agaoglu <
> >>> erdem.agaoglu@gmail.com
> >>> >>> >wrote:
> >>> >>> >
> >>> >>> >> Thanks for the answer.
> >>> >>> >>
> >>> >>> >> I am pretty sure we have dfs.support.append enabled.
I remember
> >>> both
> >>> >>> the
> >>> >>> >> conf file and the logs, and don't recall seeing any
errors on
> >>> 60010. I
> >>> >>> >> crawled through logs all yesterday but don't remember
anything
> >>> >>> indicating a
> >>> >>> >> specific error too. But i'm not sure about that. Let
me check
> that
> >>> and
> >>> >>> get
> >>> >>> >> back here on monday.
> >>> >>> >>
> >>> >>> >> On Thu, Oct 28, 2010 at 7:30 PM, Jean-Daniel Cryans
<
> >>> >>> jdcryans@apache.org>wrote:
> >>> >>> >>
> >>> >>> >>> First thing I'd check is if your configuration
has
> >>> dfs.support.append,
> >>> >>> >>> you can confirm this by looking at your region
server logs.
> When a
> >>> RS
> >>> >>> >>> starts, it creates an HLog and will print out:
"Using syncFs --
> >>> >>> >>> HDFS-200" if it's configured, else you'll see
"syncFs --
> HDFS-200
> >>> --
> >>> >>> >>> not available, dfs.support.append=false". Also
the master web
> ui
> >>> (on
> >>> >>> >>> port 60010) will print an error message regarding
that.
> >>> >>> >>>
> >>> >>> >>> If it's all ok, then you should take a look at
the master log
> when
> >>> it
> >>> >>> >>> does the log splitting and see if it contains
any obvious
> errors.
> >>> >>> >>>
> >>> >>> >>> J-D
> >>> >>> >>>
> >>> >>> >>> On Thu, Oct 28, 2010 at 12:58 AM, Erdem Agaoglu
<
> >>> >>> erdem.agaoglu@gmail.com>
> >>> >>> >>> wrote:
> >>> >>> >>> > Hi all,
> >>> >>> >>> >
> >>> >>> >>> > We have a testing cluster of 6 nodes which
we try to run an
> >>> >>> >>> HBase/MapReduce
> >>> >>> >>> > application on. In order to simulate a power
failure we kill
> >>> -9ed
> >>> >>> all
> >>> >>> >>> things
> >>> >>> >>> > hadoop related on one of the slave nodes
(DataNode,
> >>> RegionServer,
> >>> >>> >>> > TaskTracker, ZK quorum peer and i think SecondaryNameNode
was
> on
> >>> >>> this
> >>> >>> >>> node
> >>> >>> >>> > too). We were expecting a smooth transition
on all services
> but
> >>> were
> >>> >>> >>> unable
> >>> >>> >>> > to get on HBase end. While our regions seemed
intact (not
> >>> >>> confirmed), we
> >>> >>> >>> > lost table definitions that pointed some
kind of META region
> >>> fail.
> >>> >>> So
> >>> >>> >>> our
> >>> >>> >>> > application failed with several TableNotFoundExceptions.
> >>> Simulation
> >>> >>> was
> >>> >>> >>> > conducted with no-load and extremely small
data (like 10 rows
> in
> >>> 3
> >>> >>> >>> tables).
> >>> >>> >>> >
> >>> >>> >>> > On our setup, HBase is 0.89.20100924, r1001068
while Hadoop
> >>> >>> >>> > runs 0.20.3-append-r964955-1240, r960957.
Most of the
> >>> configuration
> >>> >>> >>> > parameters are in default.
> >>> >>> >>> >
> >>> >>> >>> > If we did something wrong up to this point,
please ignore the
> >>> rest
> >>> >>> of
> >>> >>> >>> the
> >>> >>> >>> > message as i'll try to explain what we did
to reproduce it
> and
> >>> might
> >>> >>> be
> >>> >>> >>> > irrelevant.
> >>> >>> >>> >
> >>> >>> >>> > Say the machines are named A, B, C, D, E,
F; where A is
> >>> master-like
> >>> >>> >>> node,
> >>> >>> >>> > others are slaves and power fail is on F.
Since we have
> little
> >>> data,
> >>> >>> we
> >>> >>> >>> have
> >>> >>> >>> > one ROOT and only one META region. I'll try
to sum up the
> whole
> >>> >>> >>> scenario.
> >>> >>> >>> >
> >>> >>> >>> > A: NN, DN, JT, TT, HM, RS
> >>> >>> >>> > B: DN, TT, RS, ZK
> >>> >>> >>> > C: DN, TT, RS, ZK
> >>> >>> >>> > D: DN, TT, RS, ZK
> >>> >>> >>> > E: DN, TT, RS, ZK
> >>> >>> >>> > F: SNN, DN, TT, RS, ZK
> >>> >>> >>> >
> >>> >>> >>> > 0. Initial state -> ROOT: F, META: A
> >>> >>> >>> > 1. Power fail on F -> ROOT: C, META: E
-> lost tables, waited
> >>> for
> >>> >>> about
> >>> >>> >>> half
> >>> >>> >>> > an hour to get nothing BTW
> >>> >>> >>> > 2. Put F back online -> No effect
> >>> >>> >>> > 3. Create a table 'testtable' to see if we
lose it
> >>> >>> >>> > 4. Kill -9ed DataNode on F -> No effect
-> Start it again
> >>> >>> >>> > 5. Kill -9ed RegionServer on F -> No effect
-> Start it again
> >>> >>> >>> > 6. Kill -9ed RegionServer on E -> ROOT:
C, META: A -> We lost
> >>> >>> >>> 'testtable'
> >>> >>> >>> > but get our tables from before the simulation.
It seemed like
> >>> >>> because A
> >>> >>> >>> had
> >>> >>> >>> > META before the simulation, the table definitions
were
> revived.
> >>> >>> >>> > 7. Restarted the whole cluster -> ROOT:
A, META: F -> We lost
> 2
> >>> out
> >>> >>> of
> >>> >>> >>> our
> >>> >>> >>> > original 6 tables, 'testtable' revived. That
small data seems
> >>> >>> corrupted
> >>> >>> >>> too
> >>> >>> >>> > as our Scans don't finish.
> >>> >>> >>> > 8. Run to mailing-list.
> >>> >>> >>> >
> >>> >>> >>> > First of all thanks for reading up to this
point. From what
> we
> >>> are
> >>> >>> now,
> >>> >>> >>> we
> >>> >>> >>> > are not even sure if this is the expected
behavior, like if
> ROOT
> >>> or
> >>> >>> META
> >>> >>> >>> > region dies we lose data and must do sth
like hbck, or if we
> are
> >>> >>> missing
> >>> >>> >>> a
> >>> >>> >>> > configuration, or if this is a bug. No need
to mention that
> we
> >>> are
> >>> >>> >>> > relatively new to HBase so the last possibility
is that if we
> >>> didn't
> >>> >>> >>> > understand it at all.
> >>> >>> >>> >
> >>> >>> >>> > Thanks in advance for any ideas.
> >>> >>> >>> >
> >>> >>> >>> > --
> >>> >>> >>> > erdem agaoglu
> >>> >>> >>> >
> >>> >>> >>>
> >>> >>> >>
> >>> >>> >>
> >>> >>> >>
> >>> >>> >> --
> >>> >>> >> erdem agaoglu
> >>> >>> >>
> >>> >>> >
> >>> >>> >
> >>> >>> >
> >>> >>> > --
> >>> >>> > erdem agaoglu
> >>> >>> >
> >>> >>>
> >>> >>
> >>> >>
> >>> >>
> >>> >> --
> >>> >> erdem agaoglu
> >>> >>
> >>> >
> >>> >
> >>> >
> >>> > --
> >>> > erdem agaoglu
> >>> >
> >>>
> >>
> >>
> >>
> >> --
> >> erdem agaoglu
> >>
> >
> >
> >
> > --
> > erdem agaoglu
> >
>



-- 
erdem agaoglu

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message