hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: HBase Region always in transition + corrupt HDFS
Date Tue, 24 Feb 2015 02:35:01 GMT
Arinto:
Probably you should take a look at HBASE-12949.

Cheers

On Mon, Feb 23, 2015 at 5:25 PM, Arinto Murdopo <arinto@gmail.com> wrote:

> @JM:
> You mentioned about deleting "the files", are you referring to HDFS files
> or file on HBase?
>
> Our cluster have 15 nodes. We used 14 of them as DN. Actually we tried to
> enable the remaining one as DN (so that we have 15 DN), but then we
> disabled it (so now we have 14 again). Probably our crawlers write some
> data into the additional DN without any replication. Maybe I could try to
> enable again the DN.
>
> I don't have the list of the corrupted files yet. I notice that when I try
> to Get some of the files, my HBase client code throws these exceptions:
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
> attempts=2, exceptions:
> Mon Feb 23 17:49:32 SGT 2015,
> org.apache.hadoop.hbase.client.HTable$3@11ff4a1c,
> org.apache.hadoop.hbase.NotServingRegionException:
> org.apache.hadoop.hbase.NotServingRegionException: Region is not online:
>
> plr_sg_insta_media_live,\x0177998597896:953:5:a5:58786,1410771627251.6c323832d2dc77c586f1cf6441c7ef6e.
>
> Can I use these exceptions to determine the corrupted files?
> The files are media data (images or videos) obtained from the internet.
>
> @Michael Segel: Yup, 3 is the default and recommended value. We were
> overwhelmed with the amount of data, so we foolishly reduced our
> replication factor. We have learnt the lesson the hard way :).
>
> Fortunately it's okay to lose the data, i.e. we can easily recover them
> from our other data.
>
>
>
> Arinto
> www.otnira.com
>
> On Tue, Feb 24, 2015 at 8:06 AM, Michael Segel <msegel@segel.com> wrote:
>
> > I’m sorry, but I implied checking the checksums of the blocks.
> > Didn’t think I needed to spell it out.  Next time I’ll be a bit more
> > precise.
> >
> > > On Feb 23, 2015, at 2:34 PM, Nick Dimiduk <ndimiduk@gmail.com> wrote:
> > >
> > > HBase/HDFS are maintaining block checksums, so presumably a corrupted
> > block
> > > would fail checksum validation. Increasing the number of replicas
> > increases
> > > the odds that you'll still have a valid block. I'm not an HDFS expert,
> > but
> > > I would be very surprised if HDFS is validating a "questionable block"
> > via
> > > byte-wise comparison over the network amongst the replica peers.
> > >
> > > On Mon, Feb 23, 2015 at 12:25 PM, Michael Segel <msegel@segel.com>
> > wrote:
> > >
> > >>
> > >> On Feb 23, 2015, at 1:47 AM, Arinto Murdopo <arinto@gmail.com> wrote:
> > >>
> > >> We're running HBase (0.94.15-cdh4.6.0) on top of HDFS (Hadoop
> > >> 2.0.0-cdh4.6.0).
> > >> For all of our tables, we set the replication factor to 1
> > (dfs.replication
> > >> = 1 in hbase-site.xml). We set to 1 because we want to minimize the
> HDFS
> > >> usage (now we realize we should set this value to at least 2, because
> > >> "failure is a norm" in distributed systems).
> > >>
> > >>
> > >>
> > >> Sorry, but you really want this to be a replication value of at least
> 3
> > >> and not 2.
> > >>
> > >> Suppose you have corruption but not a lost block. Which copy of the
> two
> > is
> > >> right?
> > >> With 3, you can compare the three and hopefully 2 of the 3 will match.
> > >>
> > >>
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message