hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Graham <billgra...@gmail.com>
Subject Re: Lost .META., lost tables
Date Mon, 24 Jan 2011 07:08:06 GMT
Thanks for all the pointers Stack. I've since re-initialized HBase so
many of the diagnostic steps you've suggested no longer apply, but
they've got me better armed for trouble next time I need them.

I've attached the master logs from the time I create mytable to when I
finally shut down the cluster to wipe out HDFS and start over. I've
heavily sed'ed out sensitive info for some dummy placeholders so let
me know if some things don't make sense.

I actually still have the old region data saved aside. Just curious,
is there an easy way to import that into the new table without writing
a MR job? If it's easy to save the data I will, but I can survive
without it.

More comments below.

thanks,
Bill


On Sun, Jan 23, 2011 at 11:16 AM, Stack <stack@duboce.net> wrote:
> On Sat, Jan 22, 2011 at 10:27 AM, Bill Graham <billgraham@gmail.com> wrote:
>> Hi,
>>
>> Last night while experimenting with getting lzo set up I managed to
>> somehow lose all .META. data and all my tables. My regions still exist
>> in HDFS, but the shell tells me I have no tables.
>
> If you scan .META., whats it say?
>
> hbase> scan '.META.'
>
>
> Is it empty?

That's a great question, I forgot that .META. is just a table that I
could scan. Next time I'll try that.

>
> You are running NTP on your cluster and all machines are close in
> time? (Edits are set with server's local time; perhaps .META. region
> moved to a machine whose clock was way behind?)
>

Yes, NTP is running and clocks are in sync.

>
>
>> At this point I'm
>> pretty sure I need to reinstall HBase clean-slate on HDFS, hence
>> losing all data, but I'm sharing my story in case there are JIRAs to
>> be created or lessons to be learned.
>>
>
> In 0.89.x and previous, bin/add_table.rb would rebuild your .META.
> You could try it.  You will probably have to restart your cluster
> after its done to have hbase assign tables (the regular process that
> would do this on a live cluster has been removed in 0.90, replaced w/
> different mechanism -- script needs updating or rather replacing but
> not done yet).
>
>
>> Specifics:
>> - 4 Node cluster running 0.90.0.rc1
>> - 1 table of a few GBs and 24 regions, let's call it TableA
>> - CDH3b2
>>
>
> Append was enabled on this cluster?  (I'm not sure if CDH enables
> append by default. Here is the flag:
>
> <property>
>  <name>dfs.support.append</name>
>  <value>true</value>
>  <description>This branch of HDFS supports reliable append/sync.
>  </description>
> </property>
>
> )
>
> If not enabled, then on crash edits to .META. may have been lost?

Yes, append is enabled.

>
>
>> 1. Just for kicks I decided to issue an alter table command to change
>> COMPRESSION to 'lzo' for TableA to see what would happen. I hadn't yet
>> taken any steps to install the native lzo libs in HBase (they exist in
>> HDFS), so this was probably a stupid thing to do. After issuing the
>> command I wasn't able to re-enable the table, nor could I fully
>> disable it. I was in a state somewhere in between the two, as
>> described in a thread earlier this week.
>
> Yeah. Sounds like the "Wayne-scenario".  If LZO libs are not properly
> installed, regions won't deploy.  They fail in messy way.  You can add
> some insurance with something like this facility
> http://hbase.apache.org/hbase.regionserver.codecs.html.  There is also
> a tool to test for proper LZO install in hbase (See 'Testing
> Compression is enabled' in
> 'http://wiki.apache.org/hadoop/UsingLzoCompression').
>
>> The shell said enabled, the
>> master.jsp said disabled. Calls to do either would time out. The
>> master server was logging the same exceptions as in HBASE-3406 ad
>> infinitum. hbck -fix wasn't doing anything. After bouncing the entire
>> cluster a few times (master, RSs, zookeepers), I was able to finally
>> get back to normal state, with COMPRESSION set to 'none'  with hbck
>> -fix.
>>
>
> Sorry for pain caused.  Enable/disable is flakey in 0.89.x and previous.
>
> Should be better in 0.90.0.
>
>
>> Besides HBASE-3406, maybe there's another JIRA here where the shell
>> permits setting COMPRESSION => 'lzo' when lzo isn't set up and leaves
>> the table in a nasty state.
>>
>
> Please add a comment to hbase-3406 and any substantiating evidence if
> you can since that issue is a little ungrounded at the mo.

Will do. It should be at least easy to reproduce now.

>
>
>> At this point I should have been grateful and called in a night, but
>> noooooo... Instead I shut down the cluster again and symlinked
>> lib/native to the same dir in my hadoop home, which is lzo-enabled and
>> I restarted the cluster. All seemed ok.
>>
>
> OK.  Serves you right for sticking with it (smile).

I know, I've learned this lesson before. When I'm working too late in
the evening to just try just one more thing and the eyes/brain are
groggy, bad things happen.

>
>
>> 2. At this point I decided to experiment with a new table after
>> reading http://wiki.apache.org/hadoop/UsingLzoCompression more
>> closely. After creating 'mytable' with lzo enabled, I saw similar
>> behavior as I did in 1. so I used the same techniques to just try to
>> just drop the table. After bouncing the cluster and issuing a hbck
>> -fix, the shell reported that HBase had no tables at all. It seemed
>> like all the .META. data was wiped out but I still had all of my
>> orphaned regions in HDFS. This was very bad.
>>
>
> Yeah.  You have that master log?  You think hbck -fix really 'fixed'
> your cluster?

I can't say for sure, but I recall that neither drop table or hbck
-fix wouldn't work so I restarted the cluster. Then hbck -summary
still showed inconsistent state so I ran hbck -fix and things seemed
ok. The drop table command then succeeded, but 'list' showed no
tables. I didn't run 'list' before the drop table command, so it's
possible they were gone before I ran drop. Actually, they could have
been gone after the bounce but before the hbck -fix command.


>
>
>> It was clear that these tables weren't coming back so in a last ditch
>> effort I stopped the HBase cluster, the SNN and the NN and I restored
>> HDFS from the checkpoint taken about an hour before.
>
> Checkpoint?  A distcp or something?

Restore to a SNN checkpoint, which now that I think of it makes no
sense. That just restores the HDFS file metadata to the last
checkpoint, not the file contents.

>
>> Now everything
>> was out of whack and HBase wouldn't even come up and -ROOT- couldn't
>> be located, .log/ files weren't being read properly and things were a
>> mess.
>>
>
> Hmm.  You think you didn't get the edits up in RS memory?  YOu didn't
> flush all regions before checkpointing?

No, I didn't flush the regions. I think the cluster couldn't start
because it wasn't able to read files in the .log/ directory in HFDS.
There were alerts in the logs about trying to split logs and not being
able to find some. In HDFS, all the files under the .log/ dir were
empty.

>
>
>> One could make the argument that I was beating on HBase a bit and
>> maybe even trying to break things, but it didn't take a lot of effort
>> to get to a pretty dire state.
>>
>
> Not good.  If you can figure a damaging sequence of steps, stick them
> in an issue and I'll try over here.  Enabling LZO w/o support messing
> stuff up is sort of known issue though we should handle it more
> gracefully for sure.

Unfortunately I can't recall the specifics and sequence of all the
things I was trying with enough confidence to make a clear JIRA. It
was some combination of disabling/enabling/deleting a table, hbck -fix
and restarting the cluster that did it.

>
> St.Ack
>

Mime
View raw message