hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mahesha999 <abnav...@gmail.com>
Subject Escaping separator in data while bulk loading using importtsv tool and ingesting numeric values
Date Thu, 07 Jul 2016 12:31:45 GMT
I am using importtsv tool to ingest data. I have some doubts. I am using
hbase 1.1.5.

First does it ingest non-string/numeric values? I was referring  this link
<http://blog.cloudera.com/blog/2013/09/how-to-use-hbase-bulk-loading-and-why/>  
detailing importtsv in cloudera distribution. It says:"it interprets
everything as strings". So I was guessing what does that mean.

I am using simple wordcount example where first column is a word and second
column is word count. 

When I keep file as follows:

"access","1"
"about","1"

and ingest and then do scan on hbase shell it gives following output:

 about                                 column=f:count,
timestamp=1467716881104, value="1"
 access                                column=f:count,
timestamp=1467716881104, value="1"

When I keep file as follows (double quotes surrounding count is removed):

"access",1
"about",1

and ingest and then do scan on hbase shell it gives following output (double
quotes surrounding count is not there):

 about                                 column=f:count,
timestamp=1467716881104, value=1
 access                                column=f:count,
timestamp=1467716881104, value=1
 
So as you can see there are no double quotes in count's value. *Q1. Does
that mean it is stored as integer and not as string? * The cloudera's
article suggests that custom MR job needs to be written for ingesting
non-string values. However I am not able to get what does that mean if above
is ingesting integer values.

Also another doubt I am having is that whether I can escape the column
separator when it appears inside the column value. For example in importtsv,
we can specify the separator as follows:

-Dimporttsv.separator=,

However what if I have employee data where first column is employee name and
second column as address? My file will have rows resembling to something
like this:

"mahesh","A6,Hyatt Appartment"

That second comma makes importtsv think that there are three columns and
throwing BadTsvLineException("Excessive columns"). 

Thus I tried escaping comma with backslash ('\') and just for sake of
curiosity escaping backslash with another backslash (that is "\\"). So my
file had following lines:

"able","1\"
"z","1\"
"za","1\\1"

When I ran scan on hbase shell, it gave following output:

 able                                  column=f:count,
timestamp=1467716881104, value="1\x5C"
 z                                     column=f:count,
timestamp=1467716881104, value="1\x5C"
 za                                    column=f:count,
timestamp=1467716881104, value="1\x5C\x5C1"

*Q2. So it seems that instead of escaping character following backslash, it
encodes backslash as "\x5C". Is it like that? Is there no way to escape
column separator while bulk loading data using importtsv?*





--
View this message in context: http://apache-hbase.679495.n3.nabble.com/Escaping-separator-in-data-while-bulk-loading-using-importtsv-tool-and-ingesting-numeric-values-tp4081081.html
Sent from the HBase User mailing list archive at Nabble.com.

Mime
View raw message