Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8606911749 for ; Thu, 24 Apr 2014 17:58:50 +0000 (UTC) Received: (qmail 58646 invoked by uid 500); 24 Apr 2014 17:58:41 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 58387 invoked by uid 500); 24 Apr 2014 17:58:41 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 58378 invoked by uid 99); 24 Apr 2014 17:58:40 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 Apr 2014 17:58:40 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of qiaoresearcher@gmail.com designates 209.85.216.182 as permitted sender) Received: from [209.85.216.182] (HELO mail-qc0-f182.google.com) (209.85.216.182) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 Apr 2014 17:58:35 +0000 Received: by mail-qc0-f182.google.com with SMTP id e16so2895691qcx.13 for ; Thu, 24 Apr 2014 10:58:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=xMWB7gPtFnRHrO7j2+aZeUlpd+Y67aIyVyqiMOBrBGs=; b=SfNSLf1eGktmrSuZSpc3HAEIYL11dAO2V/RI++3/X/aZukldMwawYPrM3IKCtRtXCo mQZ3dZI+3tw4vTpSYiMGedaz1WrNI1jqYQu8dONG6Qo5mnVTdjt+hAPf+ANjg8USbsb2 28gMUuSK2eTm+H+gzl3vOUWDWOKMxdLszQiK8wnxda4YQPnOfp6mBtMCYvrqTdLa0y15 Y7/AHY8h4SW0AeTlOkCQq2Q9UOkNK3bcXRjziWpt2TO4H6mjA5cAaPg6CLzbAX+kFfFF xoM+6IJMFXk4Y9r34AQwCUm1H8ZTQkOhQLll9w4/xJ9DHtrhc8Pefo+ByGg1gXBHnPWt YnKQ== MIME-Version: 1.0 X-Received: by 10.140.31.11 with SMTP id e11mr4574330qge.101.1398362292908; Thu, 24 Apr 2014 10:58:12 -0700 (PDT) Received: by 10.140.25.212 with HTTP; Thu, 24 Apr 2014 10:58:12 -0700 (PDT) Date: Thu, 24 Apr 2014 13:58:12 -0400 Message-ID: Subject: hadoop+python+text mining From: qiaoresearcher To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=001a113a9ceebc79c804f7cd97f9 X-Virus-Checked: Checked by ClamAV on apache.org --001a113a9ceebc79c804f7cd97f9 Content-Type: text/plain; charset=UTF-8 I have Hadoop and python installed with nltk. Now I have an large input file which has three columns: column 1 | column 2 | column 3 positive id1 some tweet message negative id2 other tweet message positive id3 tweet message negative id4 tweet message positive id5 tweet message .... ... .... I want to use text mining to construct TFIDF vectors from the tweet messages (also use stop words, stem, etc) and then use some classifier to classify tweet message as positive or negative. I know how to do it just using python and nltk. But how to do the same thing on hadoop? thanks! --001a113a9ceebc79c804f7cd97f9 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable

I have Hadoop and python installed with nltk. Now I have a= n large input file which has three columns:

column 1 =C2=A0| column 2 | column 3

positive =C2=A0 =C2=A0 =C2=A0 =C2= =A0 id1 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0some tweet message

negative =C2=A0 =C2=A0 =C2=A0 id2 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0other t= weet message

positive =C2=A0 =C2=A0 =C2=A0 =C2=A0 id3 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0tweet message

negative =C2=A0 =C2=A0 =C2=A0 id4 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0tweet m= essage

positive =C2=A0 =C2=A0 =C2=A0 =C2=A0 id5 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0tweet message

.... =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0... =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0....

=

I want to use text mining to construct TFIDF vectors fr= om the tweet messages (also use stop words, stem, etc) and then use some cl= assifier to classify tweet message as positive or negative. I know how to d= o it just using python and nltk. But how to do the same thing on hadoop?=C2= =A0

thanks!

=

--001a113a9ceebc79c804f7cd97f9--