hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ramkrishna.s.vasudevan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-18771) Incorrect StoreFileRefresh leading to split and compaction failures
Date Thu, 07 Sep 2017 12:13:01 GMT

    [ https://issues.apache.org/jira/browse/HBASE-18771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16156858#comment-16156858

ramkrishna.s.vasudevan commented on HBASE-18771:

Am just trying to understand the case better so that if there is any other route cause we
should also fix that. Nothing wrong in the patch or suggestion. 
Two things
-> Some times scanners get FNFE. So why is this? Is it due to incorrect store file accounting
? Then we should fix it.
-> Split case failures. When we actually close the store we split the files that are from
the present set of store files. The already compacted files are not included (in the normal
flow). So the references files are created for the present list of files only. I checked this
with 1.3 latest code. 
bq.(which also includes already compacted files due to incorrect refresh)
So who is doing this incorrect refresh? Again is that due to scan failure?  ( I agree the
refresh method has to checked but am more concerned about who is calling it and why?).

> Incorrect StoreFileRefresh leading to split and compaction failures
> -------------------------------------------------------------------
>                 Key: HBASE-18771
>                 URL: https://issues.apache.org/jira/browse/HBASE-18771
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 1.3.1
>            Reporter: Abhishek Singh Chouhan
>            Assignee: Abhishek Singh Chouhan
>            Priority: Blocker
>             Fix For: 3.0.0, 1.4.0, 1.3.2, 1.5.0, 2.0.0-alpha-3
>         Attachments: HBASE-18771.branch-1.3.001.patch, HBASE-18771.branch-1.3.002.patch
> We ran into issues of compaction and split failures with 1.3 similar to HBASE-18186 and
HBASE-17406. Here's what i believe is happening -
> Lets say we have 4 store files that are compacted to form a new one. At this point we
now have 5 store files, however only 1(the newly formed) is open now for the store and rest
are waiting to get archived by HFileArchiver
> Now before the files are archived we get a FNFE in a scanner. This results in HRegion.RegionScannerImpl.handleFileNotFound(FileNotFoundException
fnfe) being called which results in region.refreshStoreFiles(true) -> HStore.refreshStoreFiles()
> HStore.refreshStoreFiles now checks the hdfs dir and adds the previously compacted files
back to the store, however these files are also present in StoreFileManager's compactedFiles
list. Now at this point HFileArchiver runs, checks compactedFiles list and moves these files
into the archive directory. 
> Now when compaction runs it gets:
> 2017-09-04 12:30:13,899 ERROR [ctions-1504505399609] regionserver.CompactSplitThread
- Compaction selection failed regionName = xxxx, storeName = 0, priority = 26, time = 1504528213899
> java.io.FileNotFoundException: File does not exist: hdfs://xxxx
>         at org.apache.hadoop.hdfs.DistributedFileSystem$23.doCall(DistributedFileSystem.java:1337)
>         at org.apache.hadoop.hdfs.DistributedFileSystem$23.doCall(DistributedFileSystem.java:1329)
>         at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1329)
>         at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:422)
>         at org.apache.hadoop.hbase.regionserver.StoreFileInfo.getReferencedFileStatus(StoreFileInfo.java:342)
>         at org.apache.hadoop.hbase.regionserver.StoreFileInfo.getFileStatus(StoreFileInfo.java:355)
>         at org.apache.hadoop.hbase.regionserver.StoreFileInfo.getModificationTime(StoreFileInfo.java:360)
>         at org.apache.hadoop.hbase.regionserver.StoreFile.getModificationTimeStamp(StoreFile.java:325)
>         at org.apache.hadoop.hbase.regionserver.StoreUtils.getLowestTimestamp(StoreUtils.java:63)
>         at org.apache.hadoop.hbase.regionserver.compactions.RatioBasedCompactionPolicy.shouldPerformMajorCompaction(RatioBasedCompactionPolicy.java:65)
>         at org.apache.hadoop.hbase.regionserver.compactions.SortedCompactionPolicy.selectCompaction(SortedCompactionPolicy.java:82)
>         at org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.select(DefaultStoreEngine.java:107)
>         at org.apache.hadoop.hbase.regionserver.HStore.requestCompaction(HStore.java:1679)
> Similarly if a split happens after archival we fail after PONR while opening daughter
regions due to FNFE. This results in parent offline and daughters also in a limbo since they're
unable to open. Since we get the error after PONR we also end up aborting the RS.

This message was sent by Atlassian JIRA

View raw message