hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Esteban Gutierrez <este...@cloudera.com>
Subject Re: HBase all files corrupt / missing blocks
Date Tue, 03 Feb 2015 22:45:28 GMT
Hi Mateusz,

Thats interesting, did you started the NN with the right fsimage after the
upgrade? that might also explain this.

cheers,
esteban.


--
Cloudera, Inc.


On Tue, Feb 3, 2015 at 2:26 PM, Ellimilial K <ellimilial@googlemail.com>
wrote:

> That's quite horrible, oh well, thanks for the help!
>
> Yes, positive, we started having issues with HA quorum a couple of days
> after the migration, HBase has constantly been taking ~200 requests a
> second via stargate, things seemed to work fine.
>
> Mateusz
>
> On 3 February 2015 at 22:11, Jean-Marc Spaggiari <jean-marc@spaggiari.org>
> wrote:
>
> > Those files and related data are most probably lost.... I don't see any
> > other option than deleting them.
> >
> > Are you sure those blocks where not missing before the migration? Did you
> > have any crash over the migration process?
> >
> > JM
> >
> > 2015-02-03 13:14 GMT-08:00 Ellimilial K <ellimilial@googlemail.com>:
> >
> > > Thank you for the responses!
> > >
> > > @Jean-Mark
> > > This comes from fsck /, I see a flood of those going in at least
> > hundreds,
> > > for this particular region:
> > >
> > >
> >
> /hbase/data/default/table/ffa95306f599dbff99497e71841724fe/extracted/18c428413d7b4a89959911c9112a6eb9:
> > > CORRUPT blockpool BP-2037521063-37.59.17.102-1418127576413 block
> > > blk_1076062948
> > >
> > >
> >
> /hbase/data/default/table/ffa95306f599dbff99497e71841724fe/extracted/18c428413d7b4a89959911c9112a6eb9:
> > > MISSING 1 blocks of total size 52243482 B..
> > >
> > >
> >
> /hbase/data/default/table/ffa95306f599dbff99497e71841724fe/extracted/49b265ba5c7942b0b8e2b788fd9d7362:
> > > CORRUPT blockpool BP-2037521063-37.59.17.102-1418127576413 block
> > > blk_1076077963
> > >
> > >
> >
> /hbase/data/default/table/ffa95306f599dbff99497e71841724fe/extracted/49b265ba5c7942b0b8e2b788fd9d7362:
> > > MISSING 1 blocks of total size 6181 B...
> > >
> > >
> >
> /hbase/data/default/table/ffa95306f599dbff99497e71841724fe/pipeline/ef3fc67a835b451aa7d18094ea141451:
> > > CORRUPT blockpool BP-2037521063-37.59.17.102-1418127576413 block
> > > blk_1076062891
> > >
> > >
> >
> /hbase/data/default/table/ffa95306f599dbff99497e71841724fe/pipeline/ef3fc67a835b451aa7d18094ea141451:
> > > MISSING 1 blocks of total size 11747149 B..
> > >
> > >
> >
> /hbase/data/default/table/ffa95306f599dbff99497e71841724fe/pipeline/fedeb8062c454238bf1d1112b0f80b4b:
> > > CORRUPT blockpool BP-2037521063-37.59.17.102-1418127576413 block
> > > blk_1076077964
> > >
> > >
> >
> /hbase/data/default/table/ffa95306f599dbff99497e71841724fe/pipeline/fedeb8062c454238bf1d1112b0f80b4b:
> > > MISSING 1 blocks of total size 10431742 B..
> > >
> > >
> >
> /hbase/data/default/table/ffa95306f599dbff99497e71841724fe/processed/35186fe43fed47989ddb4ace3648b109:
> > > CORRUPT blockpool BP-2037521063-37.59.17.102-1418127576413 block
> > > blk_1076062900
> > >
> > >
> >
> /hbase/data/default/table/ffa95306f599dbff99497e71841724fe/processed/35186fe43fed47989ddb4ace3648b109:
> > > MISSING 1 blocks of total size 929610 B...
> > >
> > >
> >
> /hbase/data/default/table/ffa95306f599dbff99497e71841724fe/processed/bd41ca895f3749188c08dd2e540bc127:
> > > CORRUPT blockpool BP-2037521063-37.59.17.102-1418127576413 block
> > > blk_1076077966
> > >
> > >
> >
> /hbase/data/default/table/ffa95306f599dbff99497e71841724fe/processed/bd41ca895f3749188c08dd2e540bc127:
> > > MISSING 1 blocks of total size 119139 B.........
> > > (...) ending with:
> > > ..........Status: CORRUPT
> > >  Total size: 23155170955674 B (Total open files size: 1577 B)
> > >  Total dirs: 21232
> > >  Total files: 33311
> > >  Total symlinks: 0 (Files currently being written: 61)
> > >  Total blocks (validated): 199618 (avg. block size 115997409 B) (Total
> > open
> > > file blocks (not validated): 19)
> > >   ********************************
> > >   CORRUPT FILES: 8245
> > >   MISSING BLOCKS: 8245
> > >   MISSING SIZE: 162010861748 B
> > >   CORRUPT BLOCKS:  8245
> > >   ********************************
> > >  Minimally replicated blocks: 191373 (95.86961 %)
> > >  Over-replicated blocks: 3241 (1.6236011 %)
> > >  Under-replicated blocks: 0 (0.0 %)
> > >  Mis-replicated blocks: 0 (0.0 %)
> > >  Default replication factor: 3
> > >  Average block replication: 2.916185
> > >  Corrupt blocks: 8245
> > >  Missing replicas: 0 (0.0 %)
> > >  Number of data-nodes: 17
> > >  Number of racks: 1
> > >
> > > There are 8 files in directories within
> > > hbase/data/default/table/ffa95306f599dbff99497e71841724fe so I imagine
> > 6/8
> > > is affected.
> > > The size of missing blocks differs from 2kb up to ~ 70MB. The table
> > > concerned had ~3500 regions. All datanodes are up and look like they
> > report
> > > correctly so unfortunately no replica lying around.
> > >
> > > @esteban I double checked, the volumes seem fine, total HDFS size also
> > > looks unchanged. Datanodes look fine. It is a single cluster (i.e. no
> > > cluster replication if I'm answering the question?),freshly after an
> > > upgrade to 0.98 from 0.94 (or CDH 4.7 to 5.3), with HDFS replication
> set
> > to
> > > 3.
> > >
> > > Many thanks,
> > > Mateusz
> > >
> > > On 3 February 2015 at 20:30, Esteban Gutierrez <esteban@cloudera.com>
> > > wrote:
> > >
> > > > Hi Mateusz,
> > > >
> > > > As JMS mentioned, is very likely the data is lost, but that type of
> > > > corruption is usually due some DNs down or data volumes removed for
> > some
> > > > reason, have you tried to recover that data from those DNs first?
> > > >
> > > > From "for what looks like a continuous stream of regions" sounds like
> > you
> > > > had a single replica configured for HBase is that the case?
> > > >
> > > > esteban.
> > > >
> > > > --
> > > > Cloudera, Inc.
> > > >
> > > >
> > > > On Tue, Feb 3, 2015 at 12:04 PM, Jean-Marc Spaggiari <
> > > > jean-marc@spaggiari.org> wrote:
> > > >
> > > > > Hi Mateusz,
> > > > >
> > > > > Data from this HFile is most probably lost. Is the block also
> > reporting
> > > > > missing from fsck? Do you have any datanode down which might
> contain
> > > this
> > > > > block? How big is tis HFile? 929610 bytes only? If so, one option
> > might
> > > > > just to to delete this HFile.
> > > > >
> > > > > How many HFiles are within this region?
> > > > >
> > > > > JM
> > > > >
> > > > > 2015-02-03 10:04 GMT-08:00 Ellimilial K <ellimilial@googlemail.com
> >:
> > > > >
> > > > > > We have recently experienced some issues with our namenodes
in HA
> > > > > > arrangement and had to recreate namenode metadata from a backup
> > while
> > > > > some
> > > > > > new data has been pushed to the regions ervers in the meantime.
> > We're
> > > > on
> > > > > > HBase 98.6.
> > > > > >
> > > > > > After launching the cluster again, we have realised that we're
> > > missing
> > > > > > ~8000/190000 blocks. Looking at fsck output, we can see, for
what
> > > looks
> > > > > > like a continuous stream of regions:
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /hbase/data/default/table/ffa95306f599dbff99497e71841724fe/processed/35186fe43fed47989ddb4ace3648b109:
> > > > > > MISSING 1 blocks of total size 929610 B...
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /hbase/data/default/table/ffa95306f599dbff99497e71841724fe/processed/bd41ca895f3749188c08dd2e540bc127:
> > > > > > CORRUPT blockpool BP-2037521063-<IP>-1418127576413 block
> > > blk_1076077966
> > > > > >
> > > > > > I did not want to run fsck -delete and hbck complains because
the
> > > files
> > > > > > would not be allocated to region servers - reporting missing
> > blocks.
> > > > > >
> > > > > > The total size of this table is circa 22TB on HDFS and recreating
> > it
> > > > > would
> > > > > > be quite a drag (pushing it from our previous hbase cluster
took
> > > about
> > > > a
> > > > > > month). Is there any known way of dealing with such situation?
> > > > > >
> > > > > > Mateusz KaczyƄski
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message