hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Istvan Vajnorak (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-8593) Type support in ImportTSV tool
Date Thu, 10 Oct 2013 09:13:46 GMT

    [ https://issues.apache.org/jira/browse/HBASE-8593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13791343#comment-13791343
] 

Istvan Vajnorak commented on HBASE-8593:
----------------------------------------

Dear All,

As part of a Hadoop POC, i also bumped into this issue and decided to roll out my own loader
based on the ImportTSV tool.
On top of type safety, i also had the problem of having key chunks in the file which i had
to merge into one "compound key", and therefore decieded to extend the DSL of the input pattern
with type and with key fragment awareness similar to this:

$HBASE_HOME/bin/hbase com.msci.appdev.hbase.report.job.ReportImportJob -Dcom.msci.reports.mappingRule=KEY_PART1[i],KEY_PART2[i],KEY_PART3[s],o:t[s],o:p[i],o:r[i],o:c[s]
-Dcom.msci.reports.tablename=swarm_of_reports_5_billion -Dcom.msci.reports.inputPath=hdfs://ddc-rm-lapp0001.dev.msci.org:8020/opt/data/import/swarm_of_reports/ca737044-8b13-4fb1-b56f-e0ac66f13230.tsv
-Dcom.msci.reports.outputPath=hdfs://ddc-rm-lapp0001.dev.msci.org:8020/opt/data/import/swarm_reports_again
-Dcom.msci.reports.performBulkLoad=true

The system takes i,s,l,d type parameters, and should it find no such info, it treates the
column value as String.
Type recognition then delegates to the Bytes class for transformation such as:

public enum InputDataType {

    SHORT("s") {
        @Override
        public byte[] toBytes(String value) {
            return Bytes.toBytes(Short.parseShort(value));
        }
    },
...

Should this be of any interest, i can share the code to some extent that could help to assess
if this approach is viable or not for large scale.

One thing i noticed was the possible overhead of getting type safe on the CPU, but it can
be saved on the IO front where much less data needed to be written out in some cases.

Example:
 I can encode the number 2147483647 in an int on 4 bytes, while in String form it will be
10 bytes represented in UTF8.

Best regards, 
 Istvan



> Type support in ImportTSV tool
> ------------------------------
>
>                 Key: HBASE-8593
>                 URL: https://issues.apache.org/jira/browse/HBASE-8593
>             Project: HBase
>          Issue Type: Sub-task
>          Components: mapreduce
>            Reporter: Anoop Sam John
>            Assignee: rajeshbabu
>             Fix For: 0.96.0
>
>         Attachments: HBASE-8593.patch, HBASE-8593_v2.patch, HBASE-8593_v4.patch
>
>
> Now the ImportTSV tool treats all the table column to be of type String. It converts
the input data into bytes considering its type to be String. Some times user will need a type
of say int/float to get added to table by using this tool.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message