hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stack <st...@duboce.net>
Subject Re: Lost .META., lost tables
Date Sun, 23 Jan 2011 19:16:49 GMT
On Sat, Jan 22, 2011 at 10:27 AM, Bill Graham <billgraham@gmail.com> wrote:
> Hi,
>
> Last night while experimenting with getting lzo set up I managed to
> somehow lose all .META. data and all my tables. My regions still exist
> in HDFS, but the shell tells me I have no tables.

If you scan .META., whats it say?

hbase> scan '.META.'


Is it empty?

You are running NTP on your cluster and all machines are close in
time? (Edits are set with server's local time; perhaps .META. region
moved to a machine whose clock was way behind?)



> At this point I'm
> pretty sure I need to reinstall HBase clean-slate on HDFS, hence
> losing all data, but I'm sharing my story in case there are JIRAs to
> be created or lessons to be learned.
>

In 0.89.x and previous, bin/add_table.rb would rebuild your .META.
You could try it.  You will probably have to restart your cluster
after its done to have hbase assign tables (the regular process that
would do this on a live cluster has been removed in 0.90, replaced w/
different mechanism -- script needs updating or rather replacing but
not done yet).


> Specifics:
> - 4 Node cluster running 0.90.0.rc1
> - 1 table of a few GBs and 24 regions, let's call it TableA
> - CDH3b2
>

Append was enabled on this cluster?  (I'm not sure if CDH enables
append by default. Here is the flag:

<property>
  <name>dfs.support.append</name>
  <value>true</value>
  <description>This branch of HDFS supports reliable append/sync.
  </description>
</property>

)

If not enabled, then on crash edits to .META. may have been lost?


> 1. Just for kicks I decided to issue an alter table command to change
> COMPRESSION to 'lzo' for TableA to see what would happen. I hadn't yet
> taken any steps to install the native lzo libs in HBase (they exist in
> HDFS), so this was probably a stupid thing to do. After issuing the
> command I wasn't able to re-enable the table, nor could I fully
> disable it. I was in a state somewhere in between the two, as
> described in a thread earlier this week.

Yeah. Sounds like the "Wayne-scenario".  If LZO libs are not properly
installed, regions won't deploy.  They fail in messy way.  You can add
some insurance with something like this facility
http://hbase.apache.org/hbase.regionserver.codecs.html.  There is also
a tool to test for proper LZO install in hbase (See 'Testing
Compression is enabled' in
'http://wiki.apache.org/hadoop/UsingLzoCompression').

> The shell said enabled, the
> master.jsp said disabled. Calls to do either would time out. The
> master server was logging the same exceptions as in HBASE-3406 ad
> infinitum. hbck -fix wasn't doing anything. After bouncing the entire
> cluster a few times (master, RSs, zookeepers), I was able to finally
> get back to normal state, with COMPRESSION set to 'none'  with hbck
> -fix.
>

Sorry for pain caused.  Enable/disable is flakey in 0.89.x and previous.

Should be better in 0.90.0.


> Besides HBASE-3406, maybe there's another JIRA here where the shell
> permits setting COMPRESSION => 'lzo' when lzo isn't set up and leaves
> the table in a nasty state.
>

Please add a comment to hbase-3406 and any substantiating evidence if
you can since that issue is a little ungrounded at the mo.


> At this point I should have been grateful and called in a night, but
> noooooo... Instead I shut down the cluster again and symlinked
> lib/native to the same dir in my hadoop home, which is lzo-enabled and
> I restarted the cluster. All seemed ok.
>

OK.  Serves you right for sticking with it (smile).


> 2. At this point I decided to experiment with a new table after
> reading http://wiki.apache.org/hadoop/UsingLzoCompression more
> closely. After creating 'mytable' with lzo enabled, I saw similar
> behavior as I did in 1. so I used the same techniques to just try to
> just drop the table. After bouncing the cluster and issuing a hbck
> -fix, the shell reported that HBase had no tables at all. It seemed
> like all the .META. data was wiped out but I still had all of my
> orphaned regions in HDFS. This was very bad.
>

Yeah.  You have that master log?  You think hbck -fix really 'fixed'
your cluster?


> It was clear that these tables weren't coming back so in a last ditch
> effort I stopped the HBase cluster, the SNN and the NN and I restored
> HDFS from the checkpoint taken about an hour before.

Checkpoint?  A distcp or something?

> Now everything
> was out of whack and HBase wouldn't even come up and -ROOT- couldn't
> be located, .log/ files weren't being read properly and things were a
> mess.
>

Hmm.  You think you didn't get the edits up in RS memory?  YOu didn't
flush all regions before checkpointing?


> One could make the argument that I was beating on HBase a bit and
> maybe even trying to break things, but it didn't take a lot of effort
> to get to a pretty dire state.
>

Not good.  If you can figure a damaging sequence of steps, stick them
in an issue and I'll try over here.  Enabling LZO w/o support messing
stuff up is sort of known issue though we should handle it more
gracefully for sure.

St.Ack

Mime
View raw message