hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Hsieh <...@cloudera.com>
Subject Re: OfflineMetaRepair?
Date Fri, 06 Jan 2012 17:19:55 GMT
I suspect problems in region splitting is the root of META holes especially
for onesy-twosy hbck region consistency problems.

Jon.

On Thu, Jan 5, 2012 at 8:40 PM, Vladimir Rodionov
<vrodionov@carrieriq.com>wrote:

> Jon,
>
> My question was about "orphaned" data in hdfs in a first place. It looks
> like
> either region splits or table deletes (or both) are not executed correctly
> (with old data not being removed
> completely).
>
> Our original issue was related to .META. inconsistency (region holes) for
> one of our internal system table.
> How it occurred is beyond my comprehension, therefore I can't say for sure
> what was the reason.
>
> Best regards,
> Vladimir Rodionov
> Principal Platform Engineer
> Carrier IQ, www.carrieriq.com
> e-mail: vrodionov@carrieriq.com
>
> ________________________________________
> From: Jonathan Hsieh [jon@cloudera.com]
> Sent: Thursday, January 05, 2012 5:58 PM
> To: dev@hbase.apache.org
> Subject: Re: OfflineMetaRepair?
>
> Vlad,
>
> If it is a deleted table, you can just delete those dirs out of hdfs
> directly.
>
> The work flow for this first cut of the tool is cautious and requires the
> user to make decisions on what do with orphaned data and manually handle
> them.  Basically, at the time, I had only encountered this kind of problem
> a few times, I didn't want the tool to delete any data, and I wanted push
> that decision to the user.
>
> The problem that triggered me to write this tool was a situation where 2300
> meta rows were bad and 3 hdfs regiondirs were missing .regioninfo files.
>  Manually repairing meta was out of the question.  The likely cause in that
> situation was that the hdfs nn died under hbase and hbase likely got
> getting confused during recovery.
>
> Other cases where I've encountered similar problems generally have to do
> with regionsplits that failed to complete successfully and failed to
> rollback properly.
>
> Did you encounter any these kinds events that could have triggered your
> problems?
>
> FWIW, I'm in the process of debugging a new version (HBASE-5128) of the
> tool that is tries to automatically restore data while online.  Hopefully
> this can repair bad region splits in a relatively painless manner.
>  Currently the tests cases are good now and I'm testing against a real
> cluster that I'm intentionally corrupting.  Hopefully should have a patch
> for 0.90.5 ready in a few days (but there may be limitations).
>
> Jon.
>
> On Thu, Jan 5, 2012 at 5:37 PM, Vladimir Rodionov
> <vrodionov@carrieriq.com>wrote:
>
> > I cp'ed hdfs-site.xml into HBASE_CONF_DIR and was able tun the tool.
> >
> > The tool found a lot of abandoned regions:
> >
> > like this one:
> >
> > 12/01/06 01:18:15 ERROR util.HBaseFsck: Bailed out due to:
> > org.apache.hadoop.hbase.util.HBaseFsck$RegionInfoLoadException: Unable to
> > load region info for table TRIAL-DIMENSIONS-1324576713641!  It may be an
> > invalid format or version file.  You may want to remove hdfs://
> >
> us01-ciqps1-name01.carrieriq.com:9000/hbase/TRIAL-DIMENSIONS-1324576713641/ff6031e6472d10bac8517314179acb33regionfrom
hdfs and retry.
> >        at
> > org.apache.hadoop.hbase.util.HBaseFsck.loadTableInfo(HBaseFsck.java:292)
> >         at
> > org.apache.hadoop.hbase.util.HBaseFsck.rebuildMeta(HBaseFsck.java:402)
> >        at
> >
> org.apache.hadoop.hbase.util.hbck.OfflineMetaRepair.main(OfflineMetaRepair.java:90)
> >
> > There are hundreds of such regions, literally.
> >
> > Region directories contain only .tmp subdir, like this one:
> >
> >
> >
> /hbase/M2M-INTEGRATION-MM_ERRORS-1324575562966/fd480b2c39f7d3333308bf1d9a304510/.tmp
> >
> > No .regioninfo
> >
> > These dirs are left-overs of a tables which have been deleted already and
> > they confuse this tool.  If we delete table we should wipe out the whole
> > directory, is not it?
> > Is there any scenario which can explain this?
> >
> > Best regards,
> > Vladimir Rodionov
> > Principal Platform Engineer
> > Carrier IQ, www.carrieriq.com
> > e-mail: vrodionov@carrieriq.com
> >
> > ________________________________________
> > From: Todd Lipcon [todd@cloudera.com]
> > Sent: Thursday, January 05, 2012 4:50 PM
> > To: dev@hbase.apache.org
> > Subject: Re: OfflineMetaRepair?
> >
> > Are you sure you have fs.default.name set properly to hdfs://yournn/
> > in your hbase-site.xml?
> >
> > You shouldn't *have* to do this, but I bet it will fix the issue.
> >
> > -Todd
> >
> > On Thu, Jan 5, 2012 at 4:26 PM, Jonathan Hsieh <jon@cloudera.com> wrote:
> > > Hey Vlad,
> > >
> > > I wrote the tool -- and I've used it to repair a fairly messed up META
> > > table.  I must of used on a local filesystem copy of META (just got all
> > the
> > > .regioninfo files in their directory paths), and then shipped the
> > repaired
> > > version of the .META. dir to the customer.
> > >
> > > This is definitely a bug.  FIle the jira and I'll try to fix in the
> next
> > > few days.
> > >
> > > Jon.
> > >
> > > On Thu, Jan 5, 2012 at 4:16 PM, Vladimir Rodionov
> > > <vrodionov@carrieriq.com>wrote:
> > >
> > >> Ted,
> > >>
> > >> "fs.default.name" is a standard config property name which is
> described
> > >> here:
> > >> http://hadoop.apache.org/common/docs/current/core-default.html
> > >>
> > >> It is not CDH -specific. If you are right than this tool has never
> been
> > >> tested.
> > >>
> > >> Best regards,
> > >> Vladimir Rodionov
> > >> Principal Platform Engineer
> > >> Carrier IQ, www.carrieriq.com
> > >> e-mail: vrodionov@carrieriq.com
> > >>
> > >> ________________________________________
> > >> From: Ted Yu [yuzhihong@gmail.com]
> > >> Sent: Thursday, January 05, 2012 4:06 PM
> > >> To: dev@hbase.apache.org
> > >> Subject: Re: OfflineMetaRepair?
> > >>
> > >> Vlad:
> > >> In the future, please drop unrelated discussion from bottom of your
> > email.
> > >>
> > >> I think what you saw was caused by FS default name not being set
> > correctly.
> > >> In hbck:
> > >>        conf.set("fs.defaultFS", conf.get(HConstants.HBASE_DIR));
> > >> But cdh3 uses:
> > >>    conf.set("fs.default.name", "hdfs://localhost:0");
> > >> ./src/test/org/apache/hadoop/hdfs/server/namenode/TestCheckpoint.java
> > >>
> > >> You can try adding the following line after line 77 of
> > >> OfflineMetaRepair.java:
> > >>    conf.set("fs.default.name", path);
> > >> and rebuilding hbase 0.90.6 (tip of 0.92 branch)
> > >>
> > >> If the above works, please file a JIRA.
> > >>
> > >> Thanks
> > >>
> > >> On Thu, Jan 5, 2012 at 3:30 PM, Vladimir Rodionov
> > >> <vrodionov@carrieriq.com>wrote:
> > >>
> > >> > 0.90.5
> > >> >
> > >> > I am trying to repair .META. table using this tool
> > >> >
> > >> > 1.  HBase cluster was shutdown
> > >> >
> > >> > Then I ran:
> > >> >
> > >> > 2. [name01 bin]$ hbase
> > >> org.apache.hadoop.hbase.util.hbck.OfflineMetaRepair
> > >> > -base hdfs://us01-ciqps1-name01.carrieriq.com:9000/hbase -details
> > >> >
> > >> >
> > >> > This is waht I got:
> > >> >
> > >> > 12/01/05 23:23:15 INFO util.HBaseFsck: Loading HBase regioninfo from
> > >> > HDFS...
> > >> > 12/01/05 23:23:30 ERROR util.HBaseFsck: Bailed out due to:
> > >> > java.lang.IllegalArgumentException: Wrong FS: hdfs://
> > >> >
> > >>
> >
> us01-ciqps1-name01.carrieriq.com:9000/hbase/M2M-INTEGRATION-MM_TION-1325190318714/0003d2ede27668737e192d8430dbe5d0/.regioninfo
> > >> ,
> > >> > expected: file:///
> > >> >        at
> > org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:352)
> > >> >        at
> > >> >
> > >>
> >
> org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:47)
> > >> >        at
> > >> >
> > >>
> >
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:368)
> > >> >        at
> > >> >
> > >>
> >
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
> > >> >        at
> > >> >
> > >>
> >
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:126)
> > >> >        at
> > >> >
> > org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:284)
> > >> >        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:398)
> > >> >        at
> > >> >
> > org.apache.hadoop.hbase.util.HBaseFsck.loadMetaEntry(HBaseFsck.java:256)
> > >> >        at
> > >> >
> > org.apache.hadoop.hbase.util.HBaseFsck.loadTableInfo(HBaseFsck.java:284)
> > >> >        at
> > >> >
> org.apache.hadoop.hbase.util.HBaseFsck.rebuildMeta(HBaseFsck.java:402)
> > >> >        at
> > >> >
> > >>
> >
> org.apache.hadoop.hbase.util.hbck.OfflineMetaRepair.main(OfflineMetaRepair.java:90)
> > >> >
> > >> >
> > >> > Q: What am I doing wrong?
> > >> >
> > >> > Best regards,
> > >> > Vladimir Rodionov
> > >> > Principal Platform Engineer
> > >> > Carrier IQ, www.carrieriq.com
> > >> > e-mail: vrodionov@carrieriq.com
> > >> >
> > >> >
> > >>
> > >> Confidentiality Notice:  The information contained in this message,
> > >> including any attachments hereto, may be confidential and is intended
> > to be
> > >> read only by the individual or entity to whom this message is
> > addressed. If
> > >> the reader of this message is not the intended recipient or an agent
> or
> > >> designee of the intended recipient, please note that any review, use,
> > >> disclosure or distribution of this message or its attachments, in any
> > form,
> > >> is strictly prohibited.  If you have received this message in error,
> > please
> > >> immediately notify the sender and/or Notifications@carrieriq.com and
> > >> delete or destroy any copy of this message and its attachments.
> > >>
> > >
> > >
> > >
> > > --
> > > // Jonathan Hsieh (shay)
> > > // Software Engineer, Cloudera
> > > // jon@cloudera.com
> >
> >
> >
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
> >
> > Confidentiality Notice:  The information contained in this message,
> > including any attachments hereto, may be confidential and is intended to
> be
> > read only by the individual or entity to whom this message is addressed.
> If
> > the reader of this message is not the intended recipient or an agent or
> > designee of the intended recipient, please note that any review, use,
> > disclosure or distribution of this message or its attachments, in any
> form,
> > is strictly prohibited.  If you have received this message in error,
> please
> > immediately notify the sender and/or Notifications@carrieriq.com and
> > delete or destroy any copy of this message and its attachments.
> >
>
>
>
> --
> // Jonathan Hsieh (shay)
> // Software Engineer, Cloudera
> // jon@cloudera.com
>
> Confidentiality Notice:  The information contained in this message,
> including any attachments hereto, may be confidential and is intended to be
> read only by the individual or entity to whom this message is addressed. If
> the reader of this message is not the intended recipient or an agent or
> designee of the intended recipient, please note that any review, use,
> disclosure or distribution of this message or its attachments, in any form,
> is strictly prohibited.  If you have received this message in error, please
> immediately notify the sender and/or Notifications@carrieriq.com and
> delete or destroy any copy of this message and its attachments.
>



-- 
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// jon@cloudera.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message