hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wellington Chevreuil <wellington.chevre...@gmail.com>
Subject Re: TableSnapshotInputFormat failing to delete files under recovered.edits
Date Mon, 17 Jun 2019 19:54:48 GMT
It seems the mentioned "hiccup" caused RS(es) crash(es), as you got RITs
and recovered edits under these regions dirs. The fact there was a
"recovered" dir under some regions dirs means that when the snapshot was
taken, crashed RS(es) WAL(s) had been split, but not completely replayed
yet.

Since you are facing error when reading from table snapshot, and the stack
trace shows TableSnapshotInputFormat is using "HRegion.openHRegion" code
path to read snapshotted data, it will basically do the same as an RS would
when trying to assign a region. In this case, it's finding "recovered"
folder under regions dir, so it will replay the edits there. Looks like a
problem with TableSnapshotInputFormat, seems weird that it tries to delete
edits on a non-staging dir (your path suggests it's trying to delete the
actual edit folder), that could cause data loss if it would succeed to
delete edits before RSes actually replay it. Would you know which specific
hbase version is this? Could your job restore the snapshot into a temp
table and then read from this temp table using TableInputFormat, instead?

Em seg, 17 de jun de 2019 às 17:22, Jacob LeBlanc <
jacob.leblanc@microfocus.com> escreveu:

> Hi,
>
> We periodically execute Spark jobs to run ETL from some of our HBase
> tables to another data repository. The Spark jobs read data by taking a
> snapshot and then using the TableSnapshotInputFormat class. Lately we've
> been having some failures because when the jobs try to read the data, it is
> trying to delete files under the recovered.edits directory for some regions
> and the user under which we run the jobs doesn't have permissions to do
> that. Pastebin of the error and stack trace from one of our job logs is
> here: https://pastebin.com/MAhVc9JB
>
> This has started happening since upgrading to EMR 5.22 where the
> recovered.edits directory is collocated with the WALs in HDFS where it used
> to be in S3-backed EMRFS.
>
> I have two questions regarding this:
>
>
> 1)      First of why are these files under the recovered.edits directory?
> The timestamp of the files coincides with a hiccup we had with our cluster
> where I had to use "hbase hbck -fixAssignments" to fix regions that were
> stuck in transition. But that command seemed to work just fine and all
> regions were assigned and there have since been no inconsistencies. Does
> this mean the WALs were not replayed correctly? Does "hbase hbck
> -fixAssignments" not recover regions properly?
>
> 2)      Why is our job trying to delete these files? I don't know enough
> to say for sure, but it seems like using TableSnapshotInputFormat to read
> snapshot data should not be trying recover or delete edits.
>
> I've fixed the problems by running "assign '<region>'" in hbase shell for
> every region that had files under the recovered.edits directory and those
> files seemed to be cleaned up when the assignment completed. But I'd like
> to understand this better especially if something is interfering with
> replaying edits from WALs (also making sure our ETL jobs don't start
> failing would be nice).
>
> Thanks!
>
> --Jacob LeBlanc
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message