hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arinto Murdopo <ari...@gmail.com>
Subject Re: HBase Region always in transition + corrupt HDFS
Date Tue, 24 Feb 2015 01:25:29 GMT
@JM:
You mentioned about deleting "the files", are you referring to HDFS files
or file on HBase?

Our cluster have 15 nodes. We used 14 of them as DN. Actually we tried to
enable the remaining one as DN (so that we have 15 DN), but then we
disabled it (so now we have 14 again). Probably our crawlers write some
data into the additional DN without any replication. Maybe I could try to
enable again the DN.

I don't have the list of the corrupted files yet. I notice that when I try
to Get some of the files, my HBase client code throws these exceptions:
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
attempts=2, exceptions:
Mon Feb 23 17:49:32 SGT 2015,
org.apache.hadoop.hbase.client.HTable$3@11ff4a1c,
org.apache.hadoop.hbase.NotServingRegionException:
org.apache.hadoop.hbase.NotServingRegionException: Region is not online:
plr_sg_insta_media_live,\x0177998597896:953:5:a5:58786,1410771627251.6c323832d2dc77c586f1cf6441c7ef6e.

Can I use these exceptions to determine the corrupted files?
The files are media data (images or videos) obtained from the internet.

@Michael Segel: Yup, 3 is the default and recommended value. We were
overwhelmed with the amount of data, so we foolishly reduced our
replication factor. We have learnt the lesson the hard way :).

Fortunately it's okay to lose the data, i.e. we can easily recover them
from our other data.



Arinto
www.otnira.com

On Tue, Feb 24, 2015 at 8:06 AM, Michael Segel <msegel@segel.com> wrote:

> I’m sorry, but I implied checking the checksums of the blocks.
> Didn’t think I needed to spell it out.  Next time I’ll be a bit more
> precise.
>
> > On Feb 23, 2015, at 2:34 PM, Nick Dimiduk <ndimiduk@gmail.com> wrote:
> >
> > HBase/HDFS are maintaining block checksums, so presumably a corrupted
> block
> > would fail checksum validation. Increasing the number of replicas
> increases
> > the odds that you'll still have a valid block. I'm not an HDFS expert,
> but
> > I would be very surprised if HDFS is validating a "questionable block"
> via
> > byte-wise comparison over the network amongst the replica peers.
> >
> > On Mon, Feb 23, 2015 at 12:25 PM, Michael Segel <msegel@segel.com>
> wrote:
> >
> >>
> >> On Feb 23, 2015, at 1:47 AM, Arinto Murdopo <arinto@gmail.com> wrote:
> >>
> >> We're running HBase (0.94.15-cdh4.6.0) on top of HDFS (Hadoop
> >> 2.0.0-cdh4.6.0).
> >> For all of our tables, we set the replication factor to 1
> (dfs.replication
> >> = 1 in hbase-site.xml). We set to 1 because we want to minimize the HDFS
> >> usage (now we realize we should set this value to at least 2, because
> >> "failure is a norm" in distributed systems).
> >>
> >>
> >>
> >> Sorry, but you really want this to be a replication value of at least 3
> >> and not 2.
> >>
> >> Suppose you have corruption but not a lost block. Which copy of the two
> is
> >> right?
> >> With 3, you can compare the three and hopefully 2 of the 3 will match.
> >>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message