hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ellimilial K <ellimil...@googlemail.com>
Subject Re: HBase all files corrupt / missing blocks
Date Tue, 03 Feb 2015 23:07:43 GMT
Hi Esteban,

I believe the upgrade went fine, i.e. the stack worked for a couple of days
until the main namenode died yesterday (possibly timed out on gc?) the
backup one died(or did not roll) complaining on out of sync errors from
journalnodes. When I restarted journalnodes both namenodes started
reporting no valid fsimage. At that point I tried namenode -recover, to no
avail. Finally I put on a previously backed up snapshot of dfs name
directory from a couple hours earlier and at this point it started
reporting missing / corrupted blocks. Sorry for the non-HBase'y digression.

Thanks,
Mateusz

On 3 February 2015 at 22:45, Esteban Gutierrez <esteban@cloudera.com> wrote:

> Hi Mateusz,
>
> Thats interesting, did you started the NN with the right fsimage after the
> upgrade? that might also explain this.
>
> cheers,
> esteban.
>
>
> --
> Cloudera, Inc.
>
>
> On Tue, Feb 3, 2015 at 2:26 PM, Ellimilial K <ellimilial@googlemail.com>
> wrote:
>
> > That's quite horrible, oh well, thanks for the help!
> >
> > Yes, positive, we started having issues with HA quorum a couple of days
> > after the migration, HBase has constantly been taking ~200 requests a
> > second via stargate, things seemed to work fine.
> >
> > Mateusz
> >
> > On 3 February 2015 at 22:11, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org>
> > wrote:
> >
> > > Those files and related data are most probably lost.... I don't see any
> > > other option than deleting them.
> > >
> > > Are you sure those blocks where not missing before the migration? Did
> you
> > > have any crash over the migration process?
> > >
> > > JM
> > >
> > > 2015-02-03 13:14 GMT-08:00 Ellimilial K <ellimilial@googlemail.com>:
> > >
> > > > Thank you for the responses!
> > > >
> > > > @Jean-Mark
> > > > This comes from fsck /, I see a flood of those going in at least
> > > hundreds,
> > > > for this particular region:
> > > >
> > > >
> > >
> >
> /hbase/data/default/table/ffa95306f599dbff99497e71841724fe/extracted/18c428413d7b4a89959911c9112a6eb9:
> > > > CORRUPT blockpool BP-2037521063-37.59.17.102-1418127576413 block
> > > > blk_1076062948
> > > >
> > > >
> > >
> >
> /hbase/data/default/table/ffa95306f599dbff99497e71841724fe/extracted/18c428413d7b4a89959911c9112a6eb9:
> > > > MISSING 1 blocks of total size 52243482 B..
> > > >
> > > >
> > >
> >
> /hbase/data/default/table/ffa95306f599dbff99497e71841724fe/extracted/49b265ba5c7942b0b8e2b788fd9d7362:
> > > > CORRUPT blockpool BP-2037521063-37.59.17.102-1418127576413 block
> > > > blk_1076077963
> > > >
> > > >
> > >
> >
> /hbase/data/default/table/ffa95306f599dbff99497e71841724fe/extracted/49b265ba5c7942b0b8e2b788fd9d7362:
> > > > MISSING 1 blocks of total size 6181 B...
> > > >
> > > >
> > >
> >
> /hbase/data/default/table/ffa95306f599dbff99497e71841724fe/pipeline/ef3fc67a835b451aa7d18094ea141451:
> > > > CORRUPT blockpool BP-2037521063-37.59.17.102-1418127576413 block
> > > > blk_1076062891
> > > >
> > > >
> > >
> >
> /hbase/data/default/table/ffa95306f599dbff99497e71841724fe/pipeline/ef3fc67a835b451aa7d18094ea141451:
> > > > MISSING 1 blocks of total size 11747149 B..
> > > >
> > > >
> > >
> >
> /hbase/data/default/table/ffa95306f599dbff99497e71841724fe/pipeline/fedeb8062c454238bf1d1112b0f80b4b:
> > > > CORRUPT blockpool BP-2037521063-37.59.17.102-1418127576413 block
> > > > blk_1076077964
> > > >
> > > >
> > >
> >
> /hbase/data/default/table/ffa95306f599dbff99497e71841724fe/pipeline/fedeb8062c454238bf1d1112b0f80b4b:
> > > > MISSING 1 blocks of total size 10431742 B..
> > > >
> > > >
> > >
> >
> /hbase/data/default/table/ffa95306f599dbff99497e71841724fe/processed/35186fe43fed47989ddb4ace3648b109:
> > > > CORRUPT blockpool BP-2037521063-37.59.17.102-1418127576413 block
> > > > blk_1076062900
> > > >
> > > >
> > >
> >
> /hbase/data/default/table/ffa95306f599dbff99497e71841724fe/processed/35186fe43fed47989ddb4ace3648b109:
> > > > MISSING 1 blocks of total size 929610 B...
> > > >
> > > >
> > >
> >
> /hbase/data/default/table/ffa95306f599dbff99497e71841724fe/processed/bd41ca895f3749188c08dd2e540bc127:
> > > > CORRUPT blockpool BP-2037521063-37.59.17.102-1418127576413 block
> > > > blk_1076077966
> > > >
> > > >
> > >
> >
> /hbase/data/default/table/ffa95306f599dbff99497e71841724fe/processed/bd41ca895f3749188c08dd2e540bc127:
> > > > MISSING 1 blocks of total size 119139 B.........
> > > > (...) ending with:
> > > > ..........Status: CORRUPT
> > > >  Total size: 23155170955674 B (Total open files size: 1577 B)
> > > >  Total dirs: 21232
> > > >  Total files: 33311
> > > >  Total symlinks: 0 (Files currently being written: 61)
> > > >  Total blocks (validated): 199618 (avg. block size 115997409 B)
> (Total
> > > open
> > > > file blocks (not validated): 19)
> > > >   ********************************
> > > >   CORRUPT FILES: 8245
> > > >   MISSING BLOCKS: 8245
> > > >   MISSING SIZE: 162010861748 B
> > > >   CORRUPT BLOCKS:  8245
> > > >   ********************************
> > > >  Minimally replicated blocks: 191373 (95.86961 %)
> > > >  Over-replicated blocks: 3241 (1.6236011 %)
> > > >  Under-replicated blocks: 0 (0.0 %)
> > > >  Mis-replicated blocks: 0 (0.0 %)
> > > >  Default replication factor: 3
> > > >  Average block replication: 2.916185
> > > >  Corrupt blocks: 8245
> > > >  Missing replicas: 0 (0.0 %)
> > > >  Number of data-nodes: 17
> > > >  Number of racks: 1
> > > >
> > > > There are 8 files in directories within
> > > > hbase/data/default/table/ffa95306f599dbff99497e71841724fe so I
> imagine
> > > 6/8
> > > > is affected.
> > > > The size of missing blocks differs from 2kb up to ~ 70MB. The table
> > > > concerned had ~3500 regions. All datanodes are up and look like they
> > > report
> > > > correctly so unfortunately no replica lying around.
> > > >
> > > > @esteban I double checked, the volumes seem fine, total HDFS size
> also
> > > > looks unchanged. Datanodes look fine. It is a single cluster (i.e. no
> > > > cluster replication if I'm answering the question?),freshly after an
> > > > upgrade to 0.98 from 0.94 (or CDH 4.7 to 5.3), with HDFS replication
> > set
> > > to
> > > > 3.
> > > >
> > > > Many thanks,
> > > > Mateusz
> > > >
> > > > On 3 February 2015 at 20:30, Esteban Gutierrez <esteban@cloudera.com
> >
> > > > wrote:
> > > >
> > > > > Hi Mateusz,
> > > > >
> > > > > As JMS mentioned, is very likely the data is lost, but that type
of
> > > > > corruption is usually due some DNs down or data volumes removed for
> > > some
> > > > > reason, have you tried to recover that data from those DNs first?
> > > > >
> > > > > From "for what looks like a continuous stream of regions" sounds
> like
> > > you
> > > > > had a single replica configured for HBase is that the case?
> > > > >
> > > > > esteban.
> > > > >
> > > > > --
> > > > > Cloudera, Inc.
> > > > >
> > > > >
> > > > > On Tue, Feb 3, 2015 at 12:04 PM, Jean-Marc Spaggiari <
> > > > > jean-marc@spaggiari.org> wrote:
> > > > >
> > > > > > Hi Mateusz,
> > > > > >
> > > > > > Data from this HFile is most probably lost. Is the block also
> > > reporting
> > > > > > missing from fsck? Do you have any datanode down which might
> > contain
> > > > this
> > > > > > block? How big is tis HFile? 929610 bytes only? If so, one option
> > > might
> > > > > > just to to delete this HFile.
> > > > > >
> > > > > > How many HFiles are within this region?
> > > > > >
> > > > > > JM
> > > > > >
> > > > > > 2015-02-03 10:04 GMT-08:00 Ellimilial K <
> ellimilial@googlemail.com
> > >:
> > > > > >
> > > > > > > We have recently experienced some issues with our namenodes
in
> HA
> > > > > > > arrangement and had to recreate namenode metadata from
a backup
> > > while
> > > > > > some
> > > > > > > new data has been pushed to the regions ervers in the meantime.
> > > We're
> > > > > on
> > > > > > > HBase 98.6.
> > > > > > >
> > > > > > > After launching the cluster again, we have realised that
we're
> > > > missing
> > > > > > > ~8000/190000 blocks. Looking at fsck output, we can see,
for
> what
> > > > looks
> > > > > > > like a continuous stream of regions:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /hbase/data/default/table/ffa95306f599dbff99497e71841724fe/processed/35186fe43fed47989ddb4ace3648b109:
> > > > > > > MISSING 1 blocks of total size 929610 B...
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /hbase/data/default/table/ffa95306f599dbff99497e71841724fe/processed/bd41ca895f3749188c08dd2e540bc127:
> > > > > > > CORRUPT blockpool BP-2037521063-<IP>-1418127576413
block
> > > > blk_1076077966
> > > > > > >
> > > > > > > I did not want to run fsck -delete and hbck complains because
> the
> > > > files
> > > > > > > would not be allocated to region servers - reporting missing
> > > blocks.
> > > > > > >
> > > > > > > The total size of this table is circa 22TB on HDFS and
> recreating
> > > it
> > > > > > would
> > > > > > > be quite a drag (pushing it from our previous hbase cluster
> took
> > > > about
> > > > > a
> > > > > > > month). Is there any known way of dealing with such situation?
> > > > > > >
> > > > > > > Mateusz KaczyƄski
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message