Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2063C1007C for ; Tue, 8 Oct 2013 08:27:15 +0000 (UTC) Received: (qmail 15029 invoked by uid 500); 8 Oct 2013 08:27:10 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 14773 invoked by uid 500); 8 Oct 2013 08:27:01 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 14753 invoked by uid 99); 8 Oct 2013 08:26:59 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Oct 2013 08:26:59 +0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of sonalgoyal4@gmail.com designates 209.85.215.51 as permitted sender) Received: from [209.85.215.51] (HELO mail-la0-f51.google.com) (209.85.215.51) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Oct 2013 08:26:53 +0000 Received: by mail-la0-f51.google.com with SMTP id es20so6543692lab.38 for ; Tue, 08 Oct 2013 01:26:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=JT0Oz/BY5++sO7VYkyBFnWJ9XMRgj1DMiXaO+Cwvqzk=; b=dtlYs3U81xtp8qVAN05NhpzEOEYr3LjKqaMJMY2uwkRZzPEvvNG2YKaX3Rix1p8b0K TlHUUuMu9bUWHyDqq0xlinafUqZzjs05tPo+wQssmj/5g6lnYa5RaiSbsVgxNJEIdPix oNOOcec/PQ35mP7ixz9jtCNDmocn9LPp++PUGYGuzbPEDfgsl4AsehluFBR1TYyyXRsW crJOMGg0p9EGw3zJgpRcVSroJUS6XPnncu4uirW8UfH57x5sRYG0mQJq14mtrno2LrE2 0xnO7DLT3wSdUP4WHJKAZfahHwmq8izO3U+V6pTAhi+CYyFXrQY0lNmv5WsyfvxqzFFn IyfA== MIME-Version: 1.0 X-Received: by 10.152.44.225 with SMTP id h1mr278662lam.15.1381220790517; Tue, 08 Oct 2013 01:26:30 -0700 (PDT) Received: by 10.114.175.37 with HTTP; Tue, 8 Oct 2013 01:26:30 -0700 (PDT) In-Reply-To: References: <19E8E6F427D9A94288C4E8F4D900A0A235DEB2C6C3@BLRX7MCDC202.AMER.DELL.COM> Date: Tue, 8 Oct 2013 13:56:30 +0530 Message-ID: Subject: Re: Wikipedia Dump Analysis.. From: Sonal Goyal To: "user@hive.apache.org" Content-Type: multipart/alternative; boundary=089e0160b7be93439f04e83686c1 X-Virus-Checked: Checked by ClamAV on apache.org --089e0160b7be93439f04e83686c1 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Hi Ajeet, Unfortunately, many of us are not familiar with the Wikipedia format as to where the contributor information is coming from. If you could please highlight that and let us know where you are stuck with Hive, we could throw some ideas.. Sonal Best Regards, Sonal Nube Technologies On Tue, Oct 8, 2013 at 6:39 AM, Ajeet S Raina wrote: > Any suggestion?? > On 7 Oct 2013 11:24, "Ajeet S Raina" wrote: > >> I was just trying to see if some interesting analysis is possible or >> not.one thing which came to mind was tracking contributors and just thou= ght >> about that. >> >> Is it really possible? >> On 7 Oct 2013 11:13, "Ajeet S Raina" wrote: >> >>> I could see that revision history could be the target factor but no ide= a >>> how to go for it. Any suggestion? >>> On 7 Oct 2013 10:34, "Sonal Goyal" wrote: >>> >>>> Sorry, where is the contributor information coming from? >>>> >>>> Best Regards, >>>> Sonal >>>> Nube Technologies >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Thu, Oct 3, 2013 at 11:57 AM, Ajeet S Raina w= rote: >>>> >>>>> > Hello, >>>>> > >>>>> > >>>>> > >>>>> > I have Hadoop running on HDFS with Hive installed. I am able to >>>>> import Wikipedia dump into HDFS through the below command: >>>>> > >>>>> > >>>>> > >>>>> > >>>>> http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles= .xml.bz2 >>>>> > >>>>> > >>>>> > >>>>> > $ hadoop jar out.jar >>>>> edu.umd.cloud9.collection.wikipedia.DumpWikipediaToPlainText -input >>>>> /home/wikimedia/input/ enwiki-latest-pages-articles.xml -output >>>>> /home/wikimedia/output/3 >>>>> > >>>>> > >>>>> > >>>>> > I am able to run Hive for the Wikipedia dump through this command: >>>>> > >>>>> > >>>>> > >>>>> > I have created one sample hive table based on small data I converte= d: >>>>> > >>>>> > >>>>> > >>>>> > CREATE EXTERNAL TABLE wiki_page(page_title string, page_body string= ) >>>>> > >>>>> > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' >>>>> > >>>>> > STORED AS TEXTFILE >>>>> > >>>>> > LOCATION '/home/wikimedia/output/3'; >>>>> > >>>>> > >>>>> > >>>>> > It created for me a record as shown below: >>>>> > >>>>> > >>>>> > >>>>> > Davy Jones (musician) Davy Jones (musician) David Thomas >>>>> "Davy" Jones (30 December 1945 =E2=80=93 29 February 2012) was an Eng= lish >>>>> recording artist and actor, best known as a member of The Monkees. Ea= rly >>>>> lifeDavy Jones was born at 20 Leamington Street, Openshaw, Manchester= , >>>>> England, on 30 December 1945. At age 11, he began his acting career= =85 >>>>> > >>>>> > >>>>> > >>>>> > My overall objective is to know how many contributors are from Indi= a >>>>> and China. >>>>> > >>>>> > Any suggestion how to achieve that? >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> >>>> >>>> --089e0160b7be93439f04e83686c1 Content-Type: text/html; charset=windows-1252 Content-Transfer-Encoding: quoted-printable
Hi Ajeet,

Unfortunately, many of us are= not familiar with the Wikipedia format as to where the contributor informa= tion is coming from. If you could please highlight that and let us know whe= re you are stuck with Hive, we could throw some ideas..

Sonal

Best Regards,
Sonal
Nube Technologies=A0






On Tue, Oct 8, 2013 at 6:39 AM, Ajeet S = Raina <ajeetraina@gmail.com> wrote:

Any suggestion??

On 7 Oct 2013 11:24, "Ajeet S Raina" &= lt;ajeetraina@gma= il.com> wrote:
I was just trying to see if some interesting analysis is pos= sible or not.one thing which came to mind was tracking contributors and jus= t thought about that.

Is it really possible?

On 7 Oct 2013 11:13, "Ajeet S Raina" &= lt;ajeetraina@gma= il.com> wrote:
I could see that revision history could be the target factor= but no idea how to go for it. Any suggestion?

On 7 Oct 2013 10:34, "Sonal Goyal" <= ;sonalgoyal4@gma= il.com> wrote:
Sorry, where is the contributor information coming from?= =A0

Best Regards,Sonal
Nube Technol= ogies=A0






On Thu, Oct 3, 2013 at 11:57 AM, Ajeet S= Raina <ajeetraina@gmail.com> wrote:

> Hello,
>
> =A0
>
> I have Hadoop running on HDFS with Hive installed. I am able to import= Wikipedia dump into HDFS through the below command:
>
> =A0
>
> http://dumps.wikimedia.org/enwiki/lat= est/enwiki-latest-pages-articles.xml.bz2
>
> =A0
>
> $ hadoop jar out.jar edu.umd.cloud9.collection.wikipedia.DumpWikipedia= ToPlainText -input /home/wikimedia/input/ enwiki-latest-pages-articles.xml = =A0-output /home/wikimedia/output/3
>
> =A0
>
> I am able to run Hive for the Wikipedia dump through this command:
>
> =A0
>
> I have created one sample hive table based on small data I converted:<= br> >
> =A0
>
> CREATE EXTERNAL TABLE wiki_page(page_title string, page_body string) >
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
>
> STORED AS TEXTFILE
>
> LOCATION '/home/wikimedia/output/3';
>
> =A0
>
> It created for me a record as shown below:
>
> =A0
>
> Davy Jones (musician) Davy Jones (musician)=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0 David Thomas "Davy" Jones (30 December 1945 =E2=80=93 29 Febr= uary 2012) was an English recording artist and actor, best known as a membe= r of The Monkees. Early lifeDavy Jones was born at 20 Leamington Street, Op= enshaw, Manchester, England, on 30 December 1945. At age 11, he began his a= cting career=85
>
> =A0
>
> My overall objective is to know how many contributors are from India a= nd China.
>
> Any suggestion how to achieve that?
>
> =A0
>
> =A0
>
> =A0
>
> =A0
>
> =A0



--089e0160b7be93439f04e83686c1--