Mailing-List: contact user-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hive.apache.org
Received-SPF: pass (nike.apache.org: domain of saurabh.writes@gmail.com
 designates 209.85.220.179 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAGHyZ6+vePfNVauYYbge7MGZHRWcFV+OjZ7YpiEqcAe-r1wU0g@mail.gmail.com>
References: 
 <CACskXRmzyctu0M8YHh+QN4z10JBJtFv1MOCVeUY7XxrzMmzKxg@mail.gmail.com>
	<CAORpBsg3Qu2Yf7m6gXXskw-FtwD3j5A6qd+uELT33s1XacL1BQ@mail.gmail.com>
	<CACskXRkQf17abmM43JmhXr--tkDQ4Vn4J7Z_jd7De2BOtuJMKA@mail.gmail.com>
	<CAGHyZ6+vePfNVauYYbge7MGZHRWcFV+OjZ7YpiEqcAe-r1wU0g@mail.gmail.com>
Date: Mon, 30 Sep 2013 15:55:59 -0400
Message-ID: 
 <CACskXRnXPFz5YbLRVUG265CmR1PNeAo4bjEcvEPyLrL+tpF1vw@mail.gmail.com>
Subject: Re: Converting from textfile to sequencefile using Hive
From: Saurabh B <saurabh.writes@gmail.com>
To: user@hive.apache.org
Content-Type: multipart/alternative; boundary=047d7bf0f67aa7a78404e79f3972

--047d7bf0f67aa7a78404e79f3972
Content-Type: text/plain; charset=ISO-8859-1

Thanks Sean, that is exactly what I want.


On Mon, Sep 30, 2013 at 3:09 PM, Sean Busbey <busbey@cloudera.com> wrote:

> S,
>
> Check out these presentations from Data Science Maryland back in May[1].
>
> 1. working with Tweets in Hive:
>
>
> http://www.slideshare.net/JoeyEcheverria/analyzing-twitter-data-with-hadoop-20929978
>
> 2. then pulling stuff out of Hive to use with Mahout:
>
> http://files.meetup.com/6195792/Working%20With%20Mahout.pdf
>
> The Mahout talk didn't have a directly useful outcome (largely because it
> tried to work with the tweets as individual text documents), but it does
> get through all the mechanics of exactly what you state you want.
>
> The meetup page also has links to video, if the slides don't give enough
> context.
>
> HTH
>
> [1]: http://www.meetup.com/Data-Science-MD/events/111081282/
>
>
> On Mon, Sep 30, 2013 at 11:54 AM, Saurabh B <saurabh.writes@gmail.com>wrote:
>
>> Hi Nitin,
>>
>> No offense taken. Thank you for your response. Part of this is also
>> trying to find the right tool for the job.
>>
>> I am doing queries to determine the cuts of tweets that I want, then
>> doing some modest normalization (through a python script) and then I want
>> to create sequenceFiles from that.
>>
>> So far Hive seems to be the most convenient way to do this. But I can
>> take a look at PIG too. It looked like the "STORED AS SEQUENCEFILE" gets me
>> 99% way there. So I was wondering if there was a way to get those ids in
>> there as well. The last piece is always the stumbler :)
>>
>> Thanks again,
>>
>> S
>>
>>
>>
>>
>> On Mon, Sep 30, 2013 at 2:41 PM, Nitin Pawar <nitinpawar432@gmail.com>wrote:
>>
>>> are you using hive to just convert your text files to sequence files?
>>> If thats the case then you may want to look at the purpose why hive was
>>> developed.
>>>
>>> If you want to modify data or process data which does not involve any
>>> kind of analytics functions on a routine basis.
>>>
>>> If you want to do a data manipulation or enrichment and do not want to
>>> code a lot of map reduce job, you can take a look at pig scripts.
>>> basically what you want to do is generate an  UUID for each of your
>>> tweet and then feed it to mahout algorithms.
>>>
>>> Sorry if I understood it wrong or it sounds rude.
>>>
>>
>>
>
>
> --
> Sean
>

--047d7bf0f67aa7a78404e79f3972
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Thanks Sean, that is exactly what I want.</div><div class=
=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On Mon, Sep 30, 2013 at=
 3:09 PM, Sean Busbey <span dir=3D"ltr">&lt;<a href=3D"mailto:busbey@cloude=
ra.com" target=3D"_blank">busbey@cloudera.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">S,<div><br></div><div>Check=
 out these presentations from Data Science Maryland back in May[1].</div><d=
iv>
<br></div><div>1. working with Tweets in Hive:</div><div><br></div><div><a =
href=3D"http://www.slideshare.net/JoeyEcheverria/analyzing-twitter-data-wit=
h-hadoop-20929978" target=3D"_blank">http://www.slideshare.net/JoeyEcheverr=
ia/analyzing-twitter-data-with-hadoop-20929978</a><br>


</div><div><br></div><div>2. then pulling stuff out of Hive to use with Mah=
out:<br></div><div><br></div><div><a href=3D"http://files.meetup.com/619579=
2/Working%20With%20Mahout.pdf" target=3D"_blank">http://files.meetup.com/61=
95792/Working%20With%20Mahout.pdf</a><br>


</div><div><br></div><div>The Mahout talk didn&#39;t have a directly useful=
 outcome (largely because it tried to work with the tweets as individual te=
xt documents), but it does get through all the mechanics of exactly what yo=
u state you want.<br>


<div class=3D"gmail_extra"><br></div><div class=3D"gmail_extra">The meetup =
page also has links to video, if the slides don&#39;t give enough context.<=
/div><div class=3D"gmail_extra"><br>HTH</div><div class=3D"gmail_extra"><br=
></div>


<div class=3D"gmail_extra">[1]:=A0<a href=3D"http://www.meetup.com/Data-Sci=
ence-MD/events/111081282/" target=3D"_blank">http://www.meetup.com/Data-Sci=
ence-MD/events/111081282/</a><div><div class=3D"h5"><br><br><div class=3D"g=
mail_quote">
On Mon, Sep 30, 2013 at 11:54 AM, Saurabh B <span dir=3D"ltr">&lt;<a href=
=3D"mailto:saurabh.writes@gmail.com" target=3D"_blank">saurabh.writes@gmail=
.com</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p=
adding-left:1ex"><div dir=3D"ltr">Hi Nitin,<div><br></div><div>No offense t=
aken. Thank you for your response. Part of this is also trying to find the =
right tool for the job.</div>


<div><br></div><div>I am doing queries to determine the cuts of tweets that=
 I want, then doing some modest normalization (through a python script) and=
 then I want to create sequenceFiles from that.=A0</div>
<div><br></div><div>So far Hive seems to be the most convenient way to do t=
his. But I can take a look at PIG too. It looked like the &quot;STORED AS S=
EQUENCEFILE&quot; gets me 99% way there. So I was wondering if there was a =
way to get those ids in there as well. The last piece is always the stumble=
r :)</div>


<div><br></div><div>Thanks again,</div><div><br></div><div>S</div><div><br>=
</div><div><br></div></div><div><div><div class=3D"gmail_extra"><br><br><di=
v class=3D"gmail_quote">On Mon, Sep 30, 2013 at 2:41 PM, Nitin Pawar <span =
dir=3D"ltr">&lt;<a href=3D"mailto:nitinpawar432@gmail.com" target=3D"_blank=
">nitinpawar432@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p=
adding-left:1ex"><div dir=3D"ltr"><div class=3D"gmail_extra">are you using =
hive to just convert your text files to sequence files?=A0
</div><div class=3D"gmail_extra">If thats the case then you may want to loo=
k at the purpose why hive was developed.</div><div class=3D"gmail_extra"><b=
r></div><div class=3D"gmail_extra">If you want to modify data or process da=
ta which does not involve any kind of analytics functions on a routine basi=
s.=A0</div>


<div class=3D"gmail_extra"><br></div><div class=3D"gmail_extra">If you want=
 to do a data manipulation or enrichment and do not want to code a lot of m=
ap reduce job, you can take a look at pig scripts.=A0</div><div class=3D"gm=
ail_extra">


basically what you want to do is generate an =A0UUID for each of your tweet=
 and then feed it to mahout algorithms.=A0</div><div class=3D"gmail_extra">=
<br></div><div class=3D"gmail_extra">Sorry if I understood it wrong or it s=
ounds rude.=A0</div>


</div>
</blockquote></div><br></div>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div></div><=
/div><span class=3D"HOEnZb"><font color=3D"#888888">-- <br><div dir=3D"ltr"=
>Sean<br></div>
</font></span></div></div></div>
</blockquote></div><br></div>

--047d7bf0f67aa7a78404e79f3972--