hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bhupendra Kumar Jain (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-13702) ImportTsv: Add dry-run functionality and log bad rows
Date Wed, 03 Jun 2015 13:59:38 GMT

    [ https://issues.apache.org/jira/browse/HBASE-13702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14570867#comment-14570867
] 

Bhupendra Kumar Jain commented on HBASE-13702:
----------------------------------------------

As per current patch, dry-run executes only Map task, so its useful only when Map task is
having lot of extra code logic (parsing, validating, transformation etc... ). Dry run can
execute that logic and output the errors. 

But there might be many logic present in Combiner, Reducer phase also, Which dry-run will
not check. So I think better to rename the dry-run function as *dry-run-map*. It will be much
clear. 

> ImportTsv: Add dry-run functionality and log bad rows
> -----------------------------------------------------
>
>                 Key: HBASE-13702
>                 URL: https://issues.apache.org/jira/browse/HBASE-13702
>             Project: HBase
>          Issue Type: New Feature
>            Reporter: Apekshit Sharma
>            Assignee: Apekshit Sharma
>         Attachments: HBASE-13702.patch
>
>
> ImportTSV job skips bad records by default (keeps a count though). -Dimporttsv.skip.bad.lines=false
can be used to fail if a bad row is encountered. 
> To be easily able to determine which rows are corrupted in an input, rather than failing
on one row at a time seems like a good feature to have.
> Moreover, there should be 'dry-run' functionality in such kinds of tools, which can essentially
does a quick run of tool without making any changes but reporting any errors/warnings and
success/failure.
> To identify corrupted rows, simply logging them should be enough. In worst case, all
rows will be logged and size of logs will be same as input size, which seems fine. However,
user might have to do some work figuring out where the logs. Is there some link we can show
to the user when the tool starts which can help them with that?
> For the dry run, we can simply use if-else to skip over writing out KVs, and any other
mutations, if present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message