hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From macmarcin <macmar...@gmail.com>
Subject What is the best way to load data with control characters into HDFS?
Date Thu, 14 Apr 2011 00:39:50 GMT
I have a problem where my input data has various control characters.  I
thought that I could load this data (100+GB tab delimited) files and then
run a perl streaming script to clean it up (wanted to take advantage of
parallelization of hadoop framework).  However, since some of the data has
"^M" and other special chars, the input records get broken up into multiple
records in HDFS.  I am trying to load this data as is (not a sequence file). 

for example this is some of the data in the input file:

1232141:32432   test.com/template
Next &gt; ^M\

I have a perl script that takes care of this issue when I run it against the
input file (not in HDFS), but unfortunately it does not work in HDFS
streaming since I think it has to do with the special characters getting
translated into line breaks by HDFS loading utility.  

Is there anyway I could still use Hadoop to cleanup the data?  Or should I
just clean it up first and then load it into HDFS.


View this message in context: http://hadoop-common.472056.n3.nabble.com/What-is-the-best-way-to-load-data-with-control-characters-into-HDFS-tp2818487p2818487.html
Sent from the Users mailing list archive at Nabble.com.

View raw message