hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sandy Pratt <prat...@adobe.com>
Subject RE: HFile.Reader scans return latest version?
Date Tue, 31 May 2011 21:27:57 GMT
Thanks for the pointers.

The damage manifested as scanners skipping over a range in our time series data.  We knew
from other systems that there should be some records in that region that weren't returned.
 When we looked closely we saw an extremely improbable jump in rowkeys that should by evenly
distributed UUIDs beneath an hourly prefix.  We checked the region listing and start/end keys
in the regionserver UI, and found a region listed that wasn't being served.  We traced it
back to a couple of possible locations under /hbase, and got some odd results when we tried
to point the HFile main method at those files.

Here's the region we found missing along with the next one and the previous one:

Previous:
ets.derived.events.pb,2010-09-28-02:dcba1a8d00d945e6a90442c9561e8ac4,1285667269423 	ets-lax-prod-hadoop-10.corp.adobe.com:60030
	2010-09-28-02:dcba1a8d00d945e6a90442c9561e8ac4 	2010-09-28-05:5457075d4f9345908bdfd89b5b641d3d

Affected region:
ets.derived.events.pb,2010-09-28-05:5457075d4f9345908bdfd89b5b641d3d,1285684268773 	ets-lax-prod-hadoop-04.corp.adobe.com:60030
	
2010-09-28-05:5457075d4f9345908bdfd89b5b641d3d 	2010-09-28-11:29664000a226486e9ecb7547a738d101

Next:
ets.derived.events.pb,2010-09-28-11:29664000a226486e9ecb7547a738d101,1285687842817 	ets-lax-prod-hadoop-07.corp.adobe.com:60030
	
2010-09-28-11:29664000a226486e9ecb7547a738d101 	2010-09-28-12:f8fa9dc21bfe4091a4864d0adc655b4d


The affected region on RS UI:
ets.derived.events.pb,2010-09-28-05:5457075d4f9345908bdfd89b5b641d3d,1285684268773.1836172434
2010-09-28-05:5457075d4f9345908bdfd89b5b641d3d	2010-09-28-11:29664000a226486e9ecb7547a738d101
	stores=1, storefiles=1, storefileSizeMB=45, memstoreSizeMB=0, storefileIndexSizeMB=0


Directory for region on hdfs (guessing based on suffix from RS UI):
/hbase/ets.derived.events.pb/1836172434


Here's what happened when we ran HFile main method on those files:

Checked with HFile:

[hadoop@ets-lax-prod-hadoop-01 ~]$ hbase org.apache.hadoop.hbase.io.hfile.HFile -r 'ets.derived.events.pb,2010-09-28-05:5457075d4f9345908bdfd89b5b641d3d,1285684268773.1836172434'
-v
cat: /opt/hadoop/hbase/target/cached_classpath.txt: No such file or directory
region dir -> hdfs://ets-lax-prod-hadoop-01.corp.adobe.com:54310/hbase/ets.derived.events.pb/107531684
Number of region files found -> 0

Note that it found a different directory on hdfs than I would have thought.  Look at that
file with HFile and it doesn't like it:

[hadoop@ets-lax-prod-hadoop-01 ~]$ hbase org.apache.hadoop.hbase.io.hfile.HFile -f hdfs://ets-lax-prod-hadoop-01.corp.adobe.com:54310/hbase/ets.derived.events.pb/107531684
-v -k
cat: /opt/hadoop/hbase/target/cached_classpath.txt: No such file or directory
Scanning -> hdfs://ets-lax-prod-hadoop-01.corp.adobe.com:54310/hbase/ets.derived.events.pb/107531684
ERROR, file doesnt exist: hdfs://ets-lax-prod-hadoop-01.corp.adobe.com:54310/hbase/ets.derived.events.pb/107531684

Put in the file I thought it was, and although it's there on HDFS, HFile can't find it:
[hadoop@ets-lax-prod-hadoop-01 ~]$ hbase org.apache.hadoop.hbase.io.hfile.HFile -f hdfs://ets-lax-prod-hadoop-01.corp.adobe.com:54310/hbase/ets.derived.events.pb/1836172434
-v -k
cat: /opt/hadoop/hbase/target/cached_classpath.txt: No such file or directory
Scanning -> hdfs://ets-lax-prod-hadoop-01.corp.adobe.com:54310/hbase/ets.derived.events.pb/1836172434
java.io.FileNotFoundException: File does not exist: /hbase/ets.derived.events.pb/1836172434
        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1586)
        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1577)
        at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:428)
        at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:185)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:431)
        at org.apache.hadoop.hbase.io.hfile.HFile$Reader.<init>(HFile.java:742)
        at org.apache.hadoop.hbase.io.hfile.HFile.main(HFile.java:1870)

[hadoop@ets-lax-prod-hadoop-01 ~]$ hadoop dfs -ls  /hbase/ets.derived.events.pb/1836172434
Found 2 items
-rw-r--r--   3 hadoop hadoop        862 2010-09-28 07:31 /hbase/ets.derived.events.pb/1836172434/.regioninfo
drwxr-xr-x   - hadoop hadoop          0 2011-05-06 16:36 /hbase/ets.derived.events.pb/1836172434/f1


Ran an hbase hbck which came back clean.  Stopped HBase and restarted to find that hbck gave
errors (not sure why it was ok before and not after - maybe a split happened in the interim
or something - but we are running durable now so hopefully a change to META would not get
lost).  After that I made a backup and tried add_table.rb, which seems to make the problem
worse.   We eventually concluded that we must have lost a write to META last year when we
were running Hadoop 0.20.1 and HBase 0.20.3 without durability (currently running CDH3b3).
 This is supported by the fact that other environments running the same code are OK and hadoop
fsck / is also healthy.

My solution is to create a broadly similar table and read the HFiles from the old one directly
into it.  So this would be an MR with an HFileInputFormat I wrote using the HFile API, and
a TableOutputFormat into the new table (didn't want to put writing directly to HFiles on my
plate at this time).  Once that's done and verified, I'll drop the older table and move on.

Because of the version of HBase we're running, we don' t have hbck -fix available, and I assume
it's been months since the damage happened which might mean we have some regions overlapping.
 It might be hard to manually stitch them back together, so this holistic approach seemed
like the best bet.

One thing I can put as a win in HBase's column is that the damaged table still functions fine
in the parts that don't have holes, which is the majority of the table.  So we can keep running
for the majority of our dataset (and work) and take the time to fix the damage carefully.

Sandy

> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
> Sent: Tuesday, May 31, 2011 13:10
> To: user@hbase.apache.org
> Subject: Re: HFile.Reader scans return latest version?
> 
> On Tue, May 31, 2011 at 11:05 AM, Sandy Pratt <prattrs@adobe.com> wrote:
> > Hi all,
> >
> > I'm doing some work to read records directly from the HFiles of a damaged
> table.  When I scan through the records in the HFile using
> org.apache.hadoop.hbase.io.hfile.HFileScanner, will I get only the latest
> version of the record as with a default HBase Scan?  Or do I need to do some
> work to pull out the latest version from several?
> >
> 
> It looks like it just returns all entries in the hfile.  See tests -- e.g. TestHFile
--
> for how to make make an HFile Reader instance and pull the values.  The tail
> of HFile has some examples too?
> 
> Tell us about the 'damaged table'.
> 
> St.Ack.

Mime
View raw message