Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 53FA010AED for ; Sun, 29 Sep 2013 17:36:03 +0000 (UTC) Received: (qmail 51832 invoked by uid 500); 29 Sep 2013 17:35:58 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 51613 invoked by uid 500); 29 Sep 2013 17:35:50 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 51603 invoked by uid 99); 29 Sep 2013 17:35:49 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 29 Sep 2013 17:35:49 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of saurabh.writes@gmail.com designates 209.85.220.180 as permitted sender) Received: from [209.85.220.180] (HELO mail-vc0-f180.google.com) (209.85.220.180) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 29 Sep 2013 17:35:43 +0000 Received: by mail-vc0-f180.google.com with SMTP id ld13so3083727vcb.25 for ; Sun, 29 Sep 2013 10:35:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=6+khq5Dh/9Iz4i5oCUqI3KJrqRwk3kK95QFskef+Gjs=; b=ZSPtzlCBM/5Zbn4uLKdZVJ3X3N7uMj3YR68a21MAGbHKBAoFgr2mey9ic6pfxKj8zY +vCMelkVWVVAqiUJRK3VSEH2rXuqc+1eYx9K2Gxwl6+uwt7yPwPr/PVfvIIbUQUqsW6j D3hstGEduMC0WwGEquqLc3nD+/YD4Qv1DSlRoNSyJFzamJHk71RNa+W8YyZGGw80WxzB FFa5Js1N6qKtENSb094KK6V7kLuqRyuP26J6qrFBqXJRLeYxXI6J/ZB57LQuVnllX5cg CN6+Qb1Qx/ZoNpMS9lEoySPo8YcaS1O/0nCpmf9dKIfMNL8BoPrAEKlZYydG+GQp4UGy vhxQ== MIME-Version: 1.0 X-Received: by 10.58.118.130 with SMTP id km2mr17839786veb.0.1380476123305; Sun, 29 Sep 2013 10:35:23 -0700 (PDT) Received: by 10.220.214.135 with HTTP; Sun, 29 Sep 2013 10:35:23 -0700 (PDT) Date: Sun, 29 Sep 2013 13:35:23 -0400 Message-ID: Subject: Converting from textfile to sequencefile using Hive From: "Saurabh Bhatnagar (Business Intelligence)" To: user@hive.apache.org Content-Type: multipart/alternative; boundary=089e013a05acf3607804e7892459 X-Virus-Checked: Checked by ClamAV on apache.org --089e013a05acf3607804e7892459 Content-Type: text/plain; charset=ISO-8859-1 Hi, I have a lot of tweets saved as text. I created an external table on top of it to access it as textfile. I need to convert these to sequencefiles with each tweet as its own record. To do this, I created another table as a sequencefile table like so - CREATE EXTERNAL TABLE tweetseq( tweet STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054' STORED AS SEQUENCEFILE LOCATION '/user/hdfs/tweetseq' Now when I insert into this table from my original tweets table, each line gets its own record as expected. This is great. However, I don't have any record ids here. Short of writing my own UDF to make that happen, are there any obvious solutions I am missing here? PS, I need the ids to be there because mahout seq2sparse expects that. Without ids, it fails with - java.lang.ClassCastException: org.apache.hadoop.io.BytesWritable cannot be cast to org.apache.hadoop.io.Text at org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper.map(SequenceFileTokenizerMapper.java:37) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) at org.apache.hadoop.mapred.Child$4.run(Child.java:268) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.mapred.Child.main(Child.java:262) Regards, S --089e013a05acf3607804e7892459 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Hi,

I have a lot of tweets saved as tex= t. I created an external table on top of it to access it as textfile. I nee= d to convert these to sequencefiles with each tweet as its own record. To d= o this, I created another table as a sequencefile table like so -

CREATE EXTERNAL TABLE tweetseq(
=A0 twee= t STRING
=A0 )
=A0ROW FORMAT DELIMITED FIELDS TERMINATE= D BY '\054'
=A0STORED AS SEQUENCEFILE
LOCATION = '/user/hdfs/tweetseq'


Now when I insert into this table = from my original tweets table, each line gets its own record as expected. T= his is great. However, I don't have any record ids here. Short of writi= ng my own UDF to make that happen, are there any obvious solutions I am mis= sing here?

PS, I need the ids to be there because mahout seq2spars= e expects that. Without ids, it fails with -

= java.lang.ClassCastException: org.apache.hadoop.io.BytesWritable cannot be = cast to org.apache.hadoop.io.Text
at org.apache.mahou= t.vectorizer.document.SequenceFileTokenizerMapper.map(SequenceFileTokenizer= Mapper.java:37)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140)
at org.apache.hadoo= p.mapred.MapTask.runNewMapper(MapTask.java:672)
at org.apache.hadoop.mapred.MapTask.run(M= apTask.java:330)
at org.apache.hadoo= p.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Nativ= e Method)
at javax.security.a= uth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doA= s(UserGroupInformation.java:1408)
at org.apache.hadoo= p.mapred.Child.main(Child.java:262)

Regards,=
S
--089e013a05acf3607804e7892459--