hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: Node failure causes weird META data?
Date Mon, 01 Nov 2010 16:58:57 GMT
The fact that the tables are "revived" is a clue here IMO, but let's
go back to more basic questions...

So when you say that during step 1 you lost tables, what do you mean
by "lost"? Were the rows of those tables still in .META.? Were the
regions stuck in transition (in the shell, do: status 'detailed')? Or
when you tried to query them you just got a TableNotFoundException?

Also, the fact that only -ROOT- and not .META. was on this region
server means that if there was any data lost, it would be .META.'s
location and it would have been assigned somewhere else (E), but still
stayed assigned on A. Since the data is in the memstore, recent data
couldn't be read by this second assignment of .META. but... it could
also be reassigned for a "normal" reason like rebalancing. The best
way to confirm that is when the -ROOT- region gets reassigned at the
end of step 1 (so this is after the message that goes like "...file
splitting completed in 80372..."), do so see something like this in
the master's log: "Current assignment of .META.,,some_timestamp  is
not valid; serverAddress=blah, startcode=blah unknown."? If so, then
it seems that data was lost and this is really unexpected.

J-D

On Mon, Nov 1, 2010 at 1:36 AM, Erdem Agaoglu <erdem.agaoglu@gmail.com> wrote:
> Hi again,
>
> I have re-checked our configuration to confirm that we have
> dfs.support.append enabled and saw "Using syncFs -- HDFS-200" in logs. I
> inspected logs around log splits to find something, but i'm not sure that's
> what we are looking for. In the first step of the scenario i mentioned
> before (where we kill -9ed everything on the node that hosts the ROOT
> region), HLog says (stripping hdfs:// prefixes and hostnames for clarity)
>
> # Splitting 7 hlog(s) in .logs/F,60020,1287491528908
>
> Then it goes over every single one like
>
> # Splitting hlog 1 of 7
> # Splitting hlog 2 of 7
> # ...
> # Splitting hlog 7 of 7
>
> On the 7th hlog it WARNs with two lines
>
> # File .logs/F,60020,1287491528908/10.1.10.229%3A60020.1288021443546 might
> be still open, length is 0
> # Could not open .logs/F,60020,1287491528908/10.1.10.229%3A60020.1288021443546
> for reading. File is emptyjava.io.EOFException
>
> And completes with
>
> # log file splitting completed in 80372 millis for
> .logs/F,60020,1287491528908
>
> This might be it, but on the sixth step (where we kill -9ed the RegionServer
> that hosts the only META region), it splits 2 hlogs without any empty file
> problems nor any log above INFO, but as i told before, our testtable still
> got lost.
>
> I'll try to reproduce the problem in a cleaner way, but in the meantime, any
> kind of pointers to any problems we might have is greatly appreciated.
>
>
> On Fri, Oct 29, 2010 at 9:25 AM, Erdem Agaoglu <erdem.agaoglu@gmail.com>wrote:
>
>> Thanks for the answer.
>>
>> I am pretty sure we have dfs.support.append enabled. I remember both the
>> conf file and the logs, and don't recall seeing any errors on 60010. I
>> crawled through logs all yesterday but don't remember anything indicating a
>> specific error too. But i'm not sure about that. Let me check that and get
>> back here on monday.
>>
>> On Thu, Oct 28, 2010 at 7:30 PM, Jean-Daniel Cryans <jdcryans@apache.org>wrote:
>>
>>> First thing I'd check is if your configuration has dfs.support.append,
>>> you can confirm this by looking at your region server logs. When a RS
>>> starts, it creates an HLog and will print out: "Using syncFs --
>>> HDFS-200" if it's configured, else you'll see "syncFs -- HDFS-200 --
>>> not available, dfs.support.append=false". Also the master web ui (on
>>> port 60010) will print an error message regarding that.
>>>
>>> If it's all ok, then you should take a look at the master log when it
>>> does the log splitting and see if it contains any obvious errors.
>>>
>>> J-D
>>>
>>> On Thu, Oct 28, 2010 at 12:58 AM, Erdem Agaoglu <erdem.agaoglu@gmail.com>
>>> wrote:
>>> > Hi all,
>>> >
>>> > We have a testing cluster of 6 nodes which we try to run an
>>> HBase/MapReduce
>>> > application on. In order to simulate a power failure we kill -9ed all
>>> things
>>> > hadoop related on one of the slave nodes (DataNode, RegionServer,
>>> > TaskTracker, ZK quorum peer and i think SecondaryNameNode was on this
>>> node
>>> > too). We were expecting a smooth transition on all services but were
>>> unable
>>> > to get on HBase end. While our regions seemed intact (not confirmed), we
>>> > lost table definitions that pointed some kind of META region fail. So
>>> our
>>> > application failed with several TableNotFoundExceptions. Simulation was
>>> > conducted with no-load and extremely small data (like 10 rows in 3
>>> tables).
>>> >
>>> > On our setup, HBase is 0.89.20100924, r1001068 while Hadoop
>>> > runs 0.20.3-append-r964955-1240, r960957. Most of the configuration
>>> > parameters are in default.
>>> >
>>> > If we did something wrong up to this point, please ignore the rest of
>>> the
>>> > message as i'll try to explain what we did to reproduce it and might be
>>> > irrelevant.
>>> >
>>> > Say the machines are named A, B, C, D, E, F; where A is master-like
>>> node,
>>> > others are slaves and power fail is on F. Since we have little data, we
>>> have
>>> > one ROOT and only one META region. I'll try to sum up the whole
>>> scenario.
>>> >
>>> > A: NN, DN, JT, TT, HM, RS
>>> > B: DN, TT, RS, ZK
>>> > C: DN, TT, RS, ZK
>>> > D: DN, TT, RS, ZK
>>> > E: DN, TT, RS, ZK
>>> > F: SNN, DN, TT, RS, ZK
>>> >
>>> > 0. Initial state -> ROOT: F, META: A
>>> > 1. Power fail on F -> ROOT: C, META: E -> lost tables, waited for
about
>>> half
>>> > an hour to get nothing BTW
>>> > 2. Put F back online -> No effect
>>> > 3. Create a table 'testtable' to see if we lose it
>>> > 4. Kill -9ed DataNode on F -> No effect -> Start it again
>>> > 5. Kill -9ed RegionServer on F -> No effect -> Start it again
>>> > 6. Kill -9ed RegionServer on E -> ROOT: C, META: A -> We lost
>>> 'testtable'
>>> > but get our tables from before the simulation. It seemed like because A
>>> had
>>> > META before the simulation, the table definitions were revived.
>>> > 7. Restarted the whole cluster -> ROOT: A, META: F -> We lost 2 out
of
>>> our
>>> > original 6 tables, 'testtable' revived. That small data seems corrupted
>>> too
>>> > as our Scans don't finish.
>>> > 8. Run to mailing-list.
>>> >
>>> > First of all thanks for reading up to this point. From what we are now,
>>> we
>>> > are not even sure if this is the expected behavior, like if ROOT or META
>>> > region dies we lose data and must do sth like hbck, or if we are missing
>>> a
>>> > configuration, or if this is a bug. No need to mention that we are
>>> > relatively new to HBase so the last possibility is that if we didn't
>>> > understand it at all.
>>> >
>>> > Thanks in advance for any ideas.
>>> >
>>> > --
>>> > erdem agaoglu
>>> >
>>>
>>
>>
>>
>> --
>> erdem agaoglu
>>
>
>
>
> --
> erdem agaoglu
>

Mime
View raw message