Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C8D83101FF for ; Mon, 30 Sep 2013 19:56:36 +0000 (UTC) Received: (qmail 56744 invoked by uid 500); 30 Sep 2013 19:56:32 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 56287 invoked by uid 500); 30 Sep 2013 19:56:27 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 56237 invoked by uid 99); 30 Sep 2013 19:56:26 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 30 Sep 2013 19:56:26 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of saurabh.writes@gmail.com designates 209.85.220.179 as permitted sender) Received: from [209.85.220.179] (HELO mail-vc0-f179.google.com) (209.85.220.179) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 30 Sep 2013 19:56:20 +0000 Received: by mail-vc0-f179.google.com with SMTP id ht10so4138136vcb.24 for ; Mon, 30 Sep 2013 12:56:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=Q/SjXwRpf10V1epkuYqJ0SYB70AsjijTt/CFZDq2JsQ=; b=qFwQib2WVhAsK/8giQexi1wEn6dQ/C9uVLV27lkH0AkaNmMAdsCAhfaKSbR+HloolA hkuT8CIuSMmBrIrbjzVmFWa6+y9VZKYYnCgPCmAvtaEF2d52n5GvvfgSFMiWwxmUm9vc vRuk8Rok3KGpwAonD55970iy3u6FCgHSm+WIwd2qL2D0gsqV2WPuSkb+hXSSjMln8AkC 7/Nmm/wsoq+5ifR6e2GYxp/4TTD8zPNkHzsYEfGGGhNmrCU9WEsyUPpShLZ/UInpYCbQ Z7tnC9wphYEx0cOd3DN6aOtiPcQ/Siq5GCqfgXEKMjc9fXBawUzyj9KObiLAijKA7B8z ymtA== MIME-Version: 1.0 X-Received: by 10.220.181.136 with SMTP id by8mr24399695vcb.11.1380570959945; Mon, 30 Sep 2013 12:55:59 -0700 (PDT) Received: by 10.220.109.2 with HTTP; Mon, 30 Sep 2013 12:55:59 -0700 (PDT) In-Reply-To: References: Date: Mon, 30 Sep 2013 15:55:59 -0400 Message-ID: Subject: Re: Converting from textfile to sequencefile using Hive From: Saurabh B To: user@hive.apache.org Content-Type: multipart/alternative; boundary=047d7bf0f67aa7a78404e79f3972 X-Virus-Checked: Checked by ClamAV on apache.org --047d7bf0f67aa7a78404e79f3972 Content-Type: text/plain; charset=ISO-8859-1 Thanks Sean, that is exactly what I want. On Mon, Sep 30, 2013 at 3:09 PM, Sean Busbey wrote: > S, > > Check out these presentations from Data Science Maryland back in May[1]. > > 1. working with Tweets in Hive: > > > http://www.slideshare.net/JoeyEcheverria/analyzing-twitter-data-with-hadoop-20929978 > > 2. then pulling stuff out of Hive to use with Mahout: > > http://files.meetup.com/6195792/Working%20With%20Mahout.pdf > > The Mahout talk didn't have a directly useful outcome (largely because it > tried to work with the tweets as individual text documents), but it does > get through all the mechanics of exactly what you state you want. > > The meetup page also has links to video, if the slides don't give enough > context. > > HTH > > [1]: http://www.meetup.com/Data-Science-MD/events/111081282/ > > > On Mon, Sep 30, 2013 at 11:54 AM, Saurabh B wrote: > >> Hi Nitin, >> >> No offense taken. Thank you for your response. Part of this is also >> trying to find the right tool for the job. >> >> I am doing queries to determine the cuts of tweets that I want, then >> doing some modest normalization (through a python script) and then I want >> to create sequenceFiles from that. >> >> So far Hive seems to be the most convenient way to do this. But I can >> take a look at PIG too. It looked like the "STORED AS SEQUENCEFILE" gets me >> 99% way there. So I was wondering if there was a way to get those ids in >> there as well. The last piece is always the stumbler :) >> >> Thanks again, >> >> S >> >> >> >> >> On Mon, Sep 30, 2013 at 2:41 PM, Nitin Pawar wrote: >> >>> are you using hive to just convert your text files to sequence files? >>> If thats the case then you may want to look at the purpose why hive was >>> developed. >>> >>> If you want to modify data or process data which does not involve any >>> kind of analytics functions on a routine basis. >>> >>> If you want to do a data manipulation or enrichment and do not want to >>> code a lot of map reduce job, you can take a look at pig scripts. >>> basically what you want to do is generate an UUID for each of your >>> tweet and then feed it to mahout algorithms. >>> >>> Sorry if I understood it wrong or it sounds rude. >>> >> >> > > > -- > Sean > --047d7bf0f67aa7a78404e79f3972 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Thanks Sean, that is exactly what I want.


On Mon, Sep 30, 2013 at= 3:09 PM, Sean Busbey <busbey@cloudera.com> wrote:
S,

Check= out these presentations from Data Science Maryland back in May[1].

1. working with Tweets in Hive:


2. then pulling stuff out of Hive to use with Mah= out:


The Mahout talk didn't have a directly useful= outcome (largely because it tried to work with the tweets as individual te= xt documents), but it does get through all the mechanics of exactly what yo= u state you want.

The meetup = page also has links to video, if the slides don't give enough context.<= /div>

HTH
[1]:=A0http://www.meetup.com/Data-Sci= ence-MD/events/111081282/


On Mon, Sep 30, 2013 at 11:54 AM, Saurabh B <saurabh.writes@gmail= .com> wrote:
Hi Nitin,

No offense t= aken. Thank you for your response. Part of this is also trying to find the = right tool for the job.

I am doing queries to determine the cuts of tweets that= I want, then doing some modest normalization (through a python script) and= then I want to create sequenceFiles from that.=A0

So far Hive seems to be the most convenient way to do t= his. But I can take a look at PIG too. It looked like the "STORED AS S= EQUENCEFILE" gets me 99% way there. So I was wondering if there was a = way to get those ids in there as well. The last piece is always the stumble= r :)

Thanks again,

S

=



On Mon, Sep 30, 2013 at 2:41 PM, Nitin Pawar <nitinpawar432@gmail.com> wrote:
are you using = hive to just convert your text files to sequence files?=A0
If thats the case then you may want to loo= k at the purpose why hive was developed.
If you want to modify data or process da= ta which does not involve any kind of analytics functions on a routine basi= s.=A0

If you want= to do a data manipulation or enrichment and do not want to code a lot of m= ap reduce job, you can take a look at pig scripts.=A0
basically what you want to do is generate an =A0UUID for each of your tweet= and then feed it to mahout algorithms.=A0
=
Sorry if I understood it wrong or it s= ounds rude.=A0




<= /div>--
Sean

--047d7bf0f67aa7a78404e79f3972--