hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sonal Goyal <sonalgoy...@gmail.com>
Subject Re: Wikipedia Dump Analysis..
Date Tue, 08 Oct 2013 08:26:30 GMT
Hi Ajeet,

Unfortunately, many of us are not familiar with the Wikipedia format as to
where the contributor information is coming from. If you could please
highlight that and let us know where you are stuck with Hive, we could
throw some ideas..

Sonal

Best Regards,
Sonal
Nube Technologies <http://www.nubetech.co>

<http://in.linkedin.com/in/sonalgoyal>




On Tue, Oct 8, 2013 at 6:39 AM, Ajeet S Raina <ajeetraina@gmail.com> wrote:

> Any suggestion??
> On 7 Oct 2013 11:24, "Ajeet S Raina" <ajeetraina@gmail.com> wrote:
>
>> I was just trying to see if some interesting analysis is possible or
>> not.one thing which came to mind was tracking contributors and just thought
>> about that.
>>
>> Is it really possible?
>> On 7 Oct 2013 11:13, "Ajeet S Raina" <ajeetraina@gmail.com> wrote:
>>
>>> I could see that revision history could be the target factor but no idea
>>> how to go for it. Any suggestion?
>>> On 7 Oct 2013 10:34, "Sonal Goyal" <sonalgoyal4@gmail.com> wrote:
>>>
>>>> Sorry, where is the contributor information coming from?
>>>>
>>>> Best Regards,
>>>> Sonal
>>>> Nube Technologies <http://www.nubetech.co>
>>>>
>>>> <http://in.linkedin.com/in/sonalgoyal>
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Oct 3, 2013 at 11:57 AM, Ajeet S Raina <ajeetraina@gmail.com>wrote:
>>>>
>>>>>  > Hello,
>>>>> >
>>>>> >
>>>>> >
>>>>> > I have Hadoop running on HDFS with Hive installed. I am able to
>>>>> import Wikipedia dump into HDFS through the below command:
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
>>>>> >
>>>>> >
>>>>> >
>>>>> > $ hadoop jar out.jar
>>>>> edu.umd.cloud9.collection.wikipedia.DumpWikipediaToPlainText -input
>>>>> /home/wikimedia/input/ enwiki-latest-pages-articles.xml  -output
>>>>> /home/wikimedia/output/3
>>>>> >
>>>>> >
>>>>> >
>>>>> > I am able to run Hive for the Wikipedia dump through this command:
>>>>> >
>>>>> >
>>>>> >
>>>>> > I have created one sample hive table based on small data I converted:
>>>>> >
>>>>> >
>>>>> >
>>>>> > CREATE EXTERNAL TABLE wiki_page(page_title string, page_body string)
>>>>> >
>>>>> > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
>>>>> >
>>>>> > STORED AS TEXTFILE
>>>>> >
>>>>> > LOCATION '/home/wikimedia/output/3';
>>>>> >
>>>>> >
>>>>> >
>>>>> > It created for me a record as shown below:
>>>>> >
>>>>> >
>>>>> >
>>>>> > Davy Jones (musician) Davy Jones (musician)           David Thomas
>>>>> "Davy" Jones (30 December 1945 – 29 February 2012) was an English
>>>>> recording artist and actor, best known as a member of The Monkees. Early
>>>>> lifeDavy Jones was born at 20 Leamington Street, Openshaw, Manchester,
>>>>> England, on 30 December 1945. At age 11, he began his acting career…
>>>>> >
>>>>> >
>>>>> >
>>>>> > My overall objective is to know how many contributors are from India
>>>>> and China.
>>>>> >
>>>>> > Any suggestion how to achieve that?
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>>
>>>>
>>>>

Mime
View raw message