hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lars Hofhansl (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (HBASE-5210) HFiles are missing from an incremental load
Date Sat, 18 Jul 2015 10:16:04 GMT

     [ https://issues.apache.org/jira/browse/HBASE-5210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Lars Hofhansl resolved HBASE-5210.
    Resolution: Cannot Reproduce

Closing for now

> HFiles are missing from an incremental load
> -------------------------------------------
>                 Key: HBASE-5210
>                 URL: https://issues.apache.org/jira/browse/HBASE-5210
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.2
>         Environment: HBase 0.90.2 with Hadoop-0.20.2 (with durable sync).  RHEL 2.6.18-164.15.1.el5.
 4 node cluster (1 master, 3 slaves)
>            Reporter: Lawrence Simpson
>         Attachments: HBASE-5210-crazy-new-getRandomFilename.patch
> We run an overnight map/reduce job that loads data from an external source and adds that
data to an existing HBase table.  The input files have been loaded into hdfs.  The map/reduce
job uses the HFileOutputFormat (and the TotalOrderPartitioner) to create HFiles which are
subsequently added to the HBase table.  On at least two separate occasions (that we know of),
a range of output would be missing for a given day.  The range of keys for the missing values
corresponded to those of a particular region.  This implied that a complete HFile somehow
went missing from the job.  Further investigation revealed the following:
>  * Two different reducers (running in separate JVMs and thus separate class loaders)
>  * in the same server can end up using the same file names for their
>  * HFiles.  The scenario is as follows:
>  * 	1.	Both reducers start near the same time.
>  * 	2.	The first reducer reaches the point where it wants to write its first file.
>  * 	3.	It uses the StoreFile class which contains a static Random object 
>  * 		which is initialized by default using a timestamp.
>  * 	4.	The file name is generated using the random number generator.
>  * 	5.	The file name is checked against other existing files.
>  * 	6.	The file is written into temporary files in a directory named
>  * 		after the reducer attempt.
>  * 	7.	The second reduce task reaches the same point, but its StoreClass
>  * 		(which is now in the file system's cache) gets loaded within the
>  * 		time resolution of the OS and thus initializes its Random()
>  * 		object with the same seed as the first task.
>  * 	8.	The second task also checks for an existing file with the name
>  * 		generated by the random number generator and finds no conflict
>  * 		because each task is writing files in its own temporary folder.
>  * 	9.	The first task finishes and gets its temporary files committed
>  * 		to the "real" folder specified for output of the HFiles.
>  *     10.	The second task then reaches its own conclusion and commits its
>  * 		files (moveTaskOutputs).  The released Hadoop code just overwrites
>  * 		any files with the same name.  No warning messages or anything.
>  * 		The first task's HFiles just go missing.
>  * 
>  *  Note:  The reducers here are NOT different attempts at the same 
>  *  	reduce task.  They are different reduce tasks so data is
>  *  	really lost.
> I am currently testing a fix in which I have added code to the Hadoop 
> FileOutputCommitter.moveTaskOutputs method to check for a conflict with
> an existing file in the final output folder and to rename the HFile if
> needed.  This may not be appropriate for all uses of FileOutputFormat.
> So I have put this into a new class which is then used by a subclass of
> HFileOutputFormat.  Subclassing of FileOutputCommitter itself was a bit 
> more of a problem due to private declarations.
> I don't know if my approach is the best fix for the problem.  If someone
> more knowledgeable than myself deems that it is, I will be happy to share
> what I have done and by that time I may have some information on the
> results.

This message was sent by Atlassian JIRA

View raw message