hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Laxman (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records
Date Tue, 13 Mar 2012 09:37:39 GMT

    [ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13228297#comment-13228297
] 

Laxman commented on HBASE-5564:
-------------------------------

Scope of this issue.

1) Avoid the behavioral inconsistency with timestamp parameter.

{noformat}
Currently in code,
a) If timstamp parameter is configured, duplicate records will be overwritten.
b) If not configured, some duplicate records are maintained as different version.
{noformat}

This fix should be inline with the expectation Todd has mentioned.

bq. The whole point is that, in a bulk-load-only workflow, you can identify each bulk load
exactly, and correlate it to the MR job that inserted it.

2) Provide an option to look up timestamp column value from input data. (Like ROWKEY column)
Example : importtsv.columns='HBASE_ROW_KEY, HBASE_TS_KEY, emp:name,emp:sal,dept:code'

I will submit the patch with the above mentioned approach.

Any other addons?
                
> Bulkload is discarding duplicate records
> ----------------------------------------
>
>                 Key: HBASE-5564
>                 URL: https://issues.apache.org/jira/browse/HBASE-5564
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
>         Environment: HBase 0.92
>            Reporter: Laxman
>            Assignee: Laxman
>              Labels: bulkloader
>
> Duplicate records are getting discarded when duplicate records exists in same input file
and more specifically if they exists in same split.
> Duplicate records are considered if the records are from diffrent different splits.
> Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message