hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zach York <zyork.contribut...@gmail.com>
Subject Re: Region split caching issue
Date Tue, 20 Feb 2018 23:31:41 GMT
Thanks for the response!


On Thu, Feb 15, 2018 at 11:52 PM, Stack <stack@duboce.net> wrote:

> On Wed, Feb 14, 2018 at 3:19 PM, Zach York <zyork.contribution@gmail.com>
> wrote:
>
> > Hello,
> >
> > I have been trying for some time to figure out an issue where splits fail
> > due to trying to cache the same key, but different value.
>
>
> Whats the exception look like Mighty Zach?
>

2018-01-12 16:45:17,263 INFO
[regionserver/ip-172-31-46-118.us-west-2.compute.internal/172.31.46.118:16020-splits-1515775482373]
regionserver.SplitRequest: Running rollback/cleanup of failed split of
test,9hp9mum6k2r9j19dbvax2s35f0eaoxzb8nqwz6*024566245231038769,1510040627287.6d1496d3c9ac6ebc09b02b7b959e5c5f.;
Failed <ip>,16020,1515773679237-daughterOpener=c5ff63db3b59a7c8ebe7491811a28660
java.io.IOException: Failed
<ip>,16020,1515773679237-daughterOpener=c5ff63db3b59a7c8ebe7491811a28660
	at org.apache.hadoop.hbase.regionserver.SplitTransactionImpl.openDaughters(SplitTransactionImpl.java:504)
	at org.apache.hadoop.hbase.regionserver.SplitTransactionImpl.stepsAfterPONR(SplitTransactionImpl.java:597)
	at org.apache.hadoop.hbase.regionserver.SplitTransactionImpl.execute(SplitTransactionImpl.java:580)
	at org.apache.hadoop.hbase.regionserver.SplitRequest.doSplitting(SplitRequest.java:82)
	at org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:154)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: java.io.IOException:
java.lang.RuntimeException: Cached block contents differ, which should
not have happened.cacheKey:test=c46e62f66866b1df8d7e18c8551dbec3-3908e0f20ad3410aadf4e4e8809f90f4.6d1496d3c9ac6ebc09b02b7b959e5c5f_131435429
	at org.apache.hadoop.hbase.regionserver.HRegion.initializeStores(HRegion.java:954)
	at org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:829)
	at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:804)
	at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6717)
	at org.apache.hadoop.hbase.regionserver.SplitTransactionImpl.openDaughterRegion(SplitTransactionImpl.java:731)
	at org.apache.hadoop.hbase.regionserver.SplitTransactionImpl$DaughterOpener.run(SplitTransactionImpl.java:711)
	... 1 more
Caused by: java.io.IOException: java.lang.RuntimeException: Cached
block contents differ, which should not have
happened.cacheKey:test=c46e62f66866b1df8d7e18c8551dbec3-3908e0f20ad3410aadf4e4e8809f90f4.6d1496d3c9ac6ebc09b02b7b959e5c5f_131435429
	at org.apache.hadoop.hbase.regionserver.HStore.openStoreFiles(HStore.java:560)
	at org.apache.hadoop.hbase.regionserver.HStore.loadStoreFiles(HStore.java:514)
	at org.apache.hadoop.hbase.regionserver.HStore.<init>(HStore.java:277)
	at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:5194)
	at org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:928)
	at org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:925)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more
Caused by: java.lang.RuntimeException: Cached block contents differ,
which should not have
happened.cacheKey:test=c46e62f66866b1df8d7e18c8551dbec3-3908e0f20ad3410aadf4e4e8809f90f4.6d1496d3c9ac6ebc09b02b7b959e5c5f_131435429
	at org.apache.hadoop.hbase.io.hfile.LruBlockCache.cacheBlock(LruBlockCache.java:368)
	at org.apache.hadoop.hbase.io.hfile.CombinedBlockCache.cacheBlock(CombinedBlockCache.java:66)
	at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:462)
	at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:271)
	at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:651)
	at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:601)
	at org.apache.hadoop.hbase.io.HalfStoreFileReader$1.seekTo(HalfStoreFileReader.java:190)
	at org.apache.hadoop.hbase.io.HalfStoreFileReader.getFirstKey(HalfStoreFileReader.java:361)
	at org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:511)
	at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:528)
	at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:518)
	at org.apache.hadoop.hbase.regionserver.HStore.createStoreFileAndReader(HStore.java:667)
	at org.apache.hadoop.hbase.regionserver.HStore.access$000(HStore.java:119)
	at org.apache.hadoop.hbase.regionserver.HStore$1.call(HStore.java:534)
	at org.apache.hadoop.hbase.regionserver.HStore$1.call(HStore.java:531)
	... 6 more





>
> ...
>
>
> > What I know/have observed:
> > This occurs when prefetch on open is true
> > Both daugher store openers will try to access and cache the first and
> last
> > values in the store (HalfStoreFile). [1] [2] [3]
> > Both daugher store openers will access the same key (the splitkey, I
> > believe).
> >
>
>
> The lastkey in one daughter will be the firstkey in the other?
>
>
>
I'm still trying to figure out the logic, but it looks like in getLastKey()
it seeks to the split key and then backs up if it needs to.

[1] passes the split key into seekBefore() which then goes into [2] where
it tries to seek to that block and then it has logic to get the previous
block if necessary.

[1]
https://github.com/apache/hbase/blob/rel/1.3.1/hbase-server/src/main/java/org/apache/hadoop/hbase/io/HalfStoreFileReader.java#L307
[2]
https://github.com/apache/hbase/blob/rel/1.3.1/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java#L668


>
>
> > * However they will access two different paths and retrieve two different
> > values.
> >
>
> Different values? Or is that the same path (the half store files resolve to
> same backing storefile) but different values somehow?
>
>
> It is the same path according to the fsBlockReader. I'm not sure if the
half files would have resolved at that time or not.

See these two log lines that I printed out (where thread is the thread ID):
Note the different paths in the fsBlockReader and different thread ids:

2018-02-20 22:41:27,741 WARN  [StoreFileOpenerThread-discrete-1]
hfile.HFileReaderV2: concurrent: 0, thread: 3598, dataBlockOffset:
488903844 onDiskBlockSize: 97961, pread: true, fsBlockReader:
hfs=org.apache.hadoop.hbase.fs.HFileSystem@d65b38e,
path=s3://<bucket>/hbase/data/default/test-1/feade760d02a57f5a99acfb6bf78e0ff/cf/2d0851b27c0143a1901e253c59d44129.b59b89676a973eede358ef5e7897b4d4,
fileContext=HFileContext [ usesHBaseChecksum=true checksumType=CRC32C
bytesPerChecksum=16384 blocksize=65536 encoding=NONE includesMvcc=true
includesTags=false compressAlgo=SNAPPY compressTags=false cryptoContext=[
cipher=NONE keyHash=NONE ] ]

2018-02-20 22:41:27,821 WARN  [StoreFileOpenerThread-discrete-1]
hfile.HFileReaderV2: concurrent: 0, thread: 3597, dataBlockOffset:
488903844 onDiskBlockSize: 97961, pread: true, fsBlockReader:
hfs=org.apache.hadoop.hbase.fs.HFileSystem@d65b38e,
path=s3://<bucket>/hbase/data/default/test-1/8a61a1839a150ee520644a054efc41ef/cf/2d0851b27c0143a1901e253c59d44129.b59b89676a973eede358ef5e7897b4d4,
fileContext=HFileContext [ usesHBaseChecksum=true checksumType=CRC32C
bytesPerChecksum=16384 blocksize=65536 encoding=NONE includesMvcc=true
includesTags=false compressAlgo=SNAPPY compressTags=false cryptoContext=[
cipher=NONE keyHash=NONE ] ]



>
> > Then when they try to cache these values, it fails because the other
> thread
> > cached it's value and the two values aren't equivalent. [4]
> > This is past the point of no return for the region split so it tries to
> > roll forward, but fails. These regions are then corrupted and cannot be
> > recovered.
> >
> > One thing I have notices is that these blocks are all leaf index blocks.
> I
> > guess these index blocks would be different if it was in a different
> file.
> > This is on HBase 1.3.1.
> >
> > It is definitely a concurrency bug as both daughter openers need to try
> to
> > get the block from disk before the other caches the block. Why would it
> be
> > trying to access two different paths for the same key? Is that normal in
> a
> > reference file?
> >
> >
> The two paths are probably the two reference file paths, one for each
> daughter. That the paths are different should be fine (even if same value);
> the key the block cache uses is the based off filename.
>
> But you seem to be saying the key is the same so ultimately the file and
> offset we are getting the value with -- used to make the block cache key --
> are the same but somehow the value gotten differs?
>
>
>
> > Does anyone have any idea as to what might be occurring?
> > If you need more info, I can show you the code path that it is running
> > through.
> >
> >
> Say more Zach. Perhaps we can help out.
>
> Thanks,
>
> S
>
>
>
I'm continuing to do more investigation and will update more as I go along.
Any insight is appreciated

Thanks,
Zach

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message