Mailing-List: contact user-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hive.apache.org
Received-SPF: pass (athena.apache.org: domain of saurabh.writes@gmail.com
 designates 209.85.220.180 as permitted sender)
MIME-Version: 1.0
Date: Sun, 29 Sep 2013 13:35:23 -0400
Message-ID: 
 <CACskXRmdWrWDvGR__aEi86bpiYrucKhHt8AOPJ7=YrJX-c-stw@mail.gmail.com>
Subject: Converting from textfile to sequencefile using Hive
From: "Saurabh Bhatnagar (Business Intelligence)" <saurabh.writes@gmail.com>
To: user@hive.apache.org
Content-Type: multipart/alternative; boundary=089e013a05acf3607804e7892459

--089e013a05acf3607804e7892459
Content-Type: text/plain; charset=ISO-8859-1

Hi,

I have a lot of tweets saved as text. I created an external table on top of
it to access it as textfile. I need to convert these to sequencefiles with
each tweet as its own record. To do this, I created another table as a
sequencefile table like so -

CREATE EXTERNAL TABLE tweetseq(
  tweet STRING
  )
 ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054'
 STORED AS SEQUENCEFILE
LOCATION '/user/hdfs/tweetseq'


Now when I insert into this table from my original tweets table, each line
gets its own record as expected. This is great. However, I don't have any
record ids here. Short of writing my own UDF to make that happen, are there
any obvious solutions I am missing here?

PS, I need the ids to be there because mahout seq2sparse expects that.
Without ids, it fails with -

java.lang.ClassCastException: org.apache.hadoop.io.BytesWritable cannot be
cast to org.apache.hadoop.io.Text
at
org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper.map(SequenceFileTokenizerMapper.java:37)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.Child.main(Child.java:262)

Regards,
S

--089e013a05acf3607804e7892459
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi,<div><br></div><div>I have a lot of tweets saved as tex=
t. I created an external table on top of it to access it as textfile. I nee=
d to convert these to sequencefiles with each tweet as its own record. To d=
o this, I created another table as a sequencefile table like so -</div>
<div><br></div><div><div>CREATE EXTERNAL TABLE tweetseq(</div><div>=A0 twee=
t STRING</div><div>=A0 )</div><div>=A0ROW FORMAT DELIMITED FIELDS TERMINATE=
D BY &#39;\054&#39;</div><div>=A0STORED AS SEQUENCEFILE</div><div>LOCATION =
&#39;/user/hdfs/tweetseq&#39;</div>
</div><div><br></div><div><br></div><div>Now when I insert into this table =
from my original tweets table, each line gets its own record as expected. T=
his is great. However, I don&#39;t have any record ids here. Short of writi=
ng my own UDF to make that happen, are there any obvious solutions I am mis=
sing here?</div>
<div><br></div><div>PS, I need the ids to be there because mahout seq2spars=
e expects that. Without ids, it fails with -</div><div><br></div><div><div>=
java.lang.ClassCastException: org.apache.hadoop.io.BytesWritable cannot be =
cast to org.apache.hadoop.io.Text</div>
<div><span class=3D"" style=3D"white-space:pre">	</span>at org.apache.mahou=
t.vectorizer.document.SequenceFileTokenizerMapper.map(SequenceFileTokenizer=
Mapper.java:37)</div><div><span class=3D"" style=3D"white-space:pre">	</spa=
n>at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140)</div>
<div><span class=3D"" style=3D"white-space:pre">	</span>at org.apache.hadoo=
p.mapred.MapTask.runNewMapper(MapTask.java:672)</div><div><span class=3D"" =
style=3D"white-space:pre">	</span>at org.apache.hadoop.mapred.MapTask.run(M=
apTask.java:330)</div>
<div><span class=3D"" style=3D"white-space:pre">	</span>at org.apache.hadoo=
p.mapred.Child$4.run(Child.java:268)</div><div><span class=3D"" style=3D"wh=
ite-space:pre">	</span>at java.security.AccessController.doPrivileged(Nativ=
e Method)</div>
<div><span class=3D"" style=3D"white-space:pre">	</span>at javax.security.a=
uth.Subject.doAs(Subject.java:396)</div><div><span class=3D"" style=3D"whit=
e-space:pre">	</span>at org.apache.hadoop.security.UserGroupInformation.doA=
s(UserGroupInformation.java:1408)</div>
<div><span class=3D"" style=3D"white-space:pre">	</span>at org.apache.hadoo=
p.mapred.Child.main(Child.java:262)</div></div><div><br></div><div>Regards,=
</div><div>S</div></div>

--089e013a05acf3607804e7892459--