Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1D0A4E829 for ; Thu, 30 May 2013 19:56:16 +0000 (UTC) Received: (qmail 93822 invoked by uid 500); 30 May 2013 19:56:14 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 93723 invoked by uid 500); 30 May 2013 19:56:14 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 93713 invoked by uid 99); 30 May 2013 19:56:14 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 30 May 2013 19:56:14 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [98.137.177.253] (HELO nm16-vm5.bullet.mail.gq1.yahoo.com) (98.137.177.253) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 30 May 2013 19:56:08 +0000 Received: from [98.137.12.174] by nm16.bullet.mail.gq1.yahoo.com with NNFMP; 30 May 2013 19:55:47 -0000 Received: from [98.137.12.249] by tm13.bullet.mail.gq1.yahoo.com with NNFMP; 30 May 2013 19:55:47 -0000 Received: from [127.0.0.1] by omp1057.mail.gq1.yahoo.com with NNFMP; 30 May 2013 19:55:47 -0000 X-Yahoo-Newman-Property: ymail-3 X-Yahoo-Newman-Id: 479847.89113.bm@omp1057.mail.gq1.yahoo.com Received: (qmail 36767 invoked by uid 60001); 30 May 2013 19:55:47 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t=1369943747; bh=fxewBd2YKeO+3vEwTViul3A0vqlb6QVuCMFRDtU6DDU=; h=X-YMail-OSG:Received:X-Rocket-MIMEInfo:X-Mailer:References:Message-ID:Date:From:Reply-To:Subject:To:In-Reply-To:MIME-Version:Content-Type; b=Cnd+7d3CVFYq+RdJYEkYWVd6TuIY8SXkREbwxRfhd94ybVdIj4/70hK7Q25/4vmiJqNf5cUuooVUkVlPssQ08aVuIMUdWUbgrgyx7hsklAQt92cqF9xgctEB3B6xe+hgcgRJtMwOwR3Cct+T0zMqczePnwzDzpgcQ8cQyFHmopA= DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=X-YMail-OSG:Received:X-Rocket-MIMEInfo:X-Mailer:References:Message-ID:Date:From:Reply-To:Subject:To:In-Reply-To:MIME-Version:Content-Type; b=z8rk1M1iUWDURPeuCrrsrgQguvCFX5aPXVD30hpHowMUpIF0BNptBIWqnR4egROInPpGHiQT0yXxFEzKIpcNw5i4HLC7+dGvXWcjDX4fe1mOOR4DGI2XH0f0VaHszOlHa10u117EHlphR/Xk0fgMSpIIJCXOSyoWLKdRWxUbNl0=; X-YMail-OSG: LIAsZGcVM1nN.Jg6TJKZ2fhrrqZvwQzoYQOSgqglroJuwco SPI2jTrKpQ6GVJIbnZbh3o0a1pi5vDwbRelzoXp7fwQd8fCS0pK99rsX0RU4 Dq0HNzOq4.2apGkl58mQfv78eXwn1UuS6wawXe8Vu8iy_WCvDIsvPaVewptK k.vaaOuGedxM2Cfflzmm6jbOo3C14kLGCJMRbR50G2r1nAORX3s1IdYgeSFd wbwSn3H4KTXb6.4FOQ0zV3k4m67QQ0.TFAxU_zc1GDSgf3d0qoTYRIq7U_K. qjFLbqEMoOSiNshTtayM1GyxmywCJTzu3LM6KyDdlUcoJ8GgyCp9uMyq8BZL 1r3UEKmIsmbAmbU9ycmyyUveqEBpb5_J1aUhIw5luGAwLs9OGnhU65ReKyvf dewhWqokXaI59EbazYok8m2uL8saAJ15qO_8nMrnDq7KJUgSzcegU4SKmgyh oEjM2HtW.PeDiLnqCBAUY7DiMAtfqBaIqyOM1YVJh0Uy.hdKiW7aiQZidjXD Tady7shCLM79a9jisQ6SI3xT0elWteQQoxx4- Received: from [216.168.230.7] by web163503.mail.gq1.yahoo.com via HTTP; Thu, 30 May 2013 12:55:47 PDT X-Rocket-MIMEInfo: 002.001,VGhhdCdzIGNvcnJlY3QuwqAgQWxzbyB0aGF0IFNub3diYWxsQW5hbHl6ZXIgaW1wbGljaXRseSBjb252ZXJ0cyBhbGwgdGV4dCB0byBsb3dlciBjYXNlIGFuZCB1IGNvdWxkIGF2b2lkIHRoYXQgc3RlcCBpbiB1ciBjb21wdXRhdGlvbi4KCkFsbCBvZiB5b3VyIGtleXdvcmRzIHdvdWxkIGhhdmUgdG8gYmUgZmlyc3QgcnVuIHRocm91Z2ggdGhlIFNub3diYWxsQW5hbHl6ZXIgYW5kIHRoZSBzYW1lIGdvZXMgZm9yIHlvdXIgZG9jdW1lbnRzCmJlZm9yZSB1IG1ha2UgdGhlIGNhbGwgdG8gTXVsdGlzZXQucmV0YWkBMAEBAQE- X-Mailer: YahooMailWebService/0.8.145.547 References: <2270F3695BF2C04DB9FE975FB546FF3C061AC2@NDA-HCLC-MBS03.hclc.corp.hcl.in> <1369153248.73127.YahooMailNeo@web163506.mail.gq1.yahoo.com> <1369155054.86420.YahooMailNeo@web163504.mail.gq1.yahoo.com> <2270F3695BF2C04DB9FE975FB546FF3C062239@NDA-HCLC-MBS03.hclc.corp.hcl.in> <2270F3695BF2C04DB9FE975FB546FF3C062327@NDA-HCLC-MBS03.hclc.corp.hcl.in> <1369226601.72569.YahooMailNeo@web163501.mail.gq1.yahoo.com> <2270F3695BF2C04DB9FE975FB546FF3C08B23B@NDA-HCLC-MBS06.hclc.corp.hcl.in> <2270F3695BF2C04DB9FE975FB546FF3C08B2D6@NDA-HCLC-MBS06.hclc.corp.hcl.in> Message-ID: <1369943747.36575.YahooMailNeo@web163503.mail.gq1.yahoo.com> Date: Thu, 30 May 2013 12:55:47 -0700 (PDT) From: Suneel Marthi Reply-To: Suneel Marthi Subject: Re: Feature vector generation from Bag-of-Words To: "user@mahout.apache.org" In-Reply-To: <2270F3695BF2C04DB9FE975FB546FF3C08B2D6@NDA-HCLC-MBS06.hclc.corp.hcl.in> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="48240482-1686965475-1369943747=:36575" X-Virus-Checked: Checked by ClamAV on apache.org --48240482-1686965475-1369943747=:36575 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable That's correct.=A0 Also that SnowballAnalyzer implicitly converts all text = to lower case and u could avoid that step in ur computation.=0A=0AAll of yo= ur keywords would have to be first run through the SnowballAnalyzer and the= same goes for your documents=0Abefore u make the call to Multiset.retainal= l(keywords).=0A=0AI am assuming that all ur documents are 'English' text on= ly. Lucene has language specific analyzers (some of which implicitly invoke= the SnowballFilter) if u have to deal with other languages.=0A=0AWhile on = the topic of Lucene, ensure that u r using Lucene 4.2.x libraries (that's t= he Lucene version in Mahout trunk).=0A=0A=0A=0A=0A_________________________= _______=0A From: Stuti Awasthi =0ATo: "'user@mahout.a= pache.org'" =0ASent: Thursday, May 30, 2013 8:34 A= M=0ASubject: RE: Feature vector generation from Bag-of-Words=0A =0A=0AHey S= uneel,=0A=0AI got this stemming working with SnowballAnalyzer. One more que= ry to use the Multiset.retainall(keywords) functionality, all of my keyword= s must also be generated with Analyzer else they won't be retained.=0AIs my= understanding correct ?=0A=0AThanks=0AStuti Awasthi=0A=0A-----Original Mes= sage-----=0AFrom: Stuti Awasthi =0ASent: Thursday, May 30, 2013 3:59 PM=0AT= o: user@mahout.apache.org=0ASubject: RE: Feature vector generation from Bag= -of-Words=0A=0AHi Suneel,=0A=0AThanks, For the point 2, I tried to look how= to achieve this using Lucene but was not able to gather much information. = =0AIt would be helpful if you can guide me through the relevant links or sa= mples through which I can achieve Point 2.=0A=0AThanks=0AStuti Awasthi=0A= =0A-----Original Message-----=0AFrom: Suneel Marthi [mailto:suneel_marthi@y= ahoo.com] =0ASent: Wednesday, May 22, 2013 6:13 PM=0ATo: user@mahout.apache= .org=0ASubject: Re: Feature vector generation from Bag-of-Words=0A=0ASee in= line.=0A=0A=0A=0A=0A________________________________=0AFrom: Stuti Awasthi = =0ATo: "'user@mahout.apache.org'" =0ASent: Wednesday, May 22, 2013 7:02 AM=0ASubject: RE: Feature vecto= r generation from Bag-of-Words=0A=0A=0AHi Suneel,=0A=0AI implemented your s= uggested approach. This was simple to implement and you have made the steps= pretty clear. Thankyou :) . I have few query in creating Features using Mu= ltiset:=0A=0A1. Can't we consider keyword Case Insensitiveness using multis= et i.e my keyword may be "Day" and in document it may be "day". =0A=0A>>Yes= , you can if that's a requirement for you. Convert all keywords to lowercas= e before storing them in multiset.=0A=0A2. Can we use the=A0 multiset to co= ntain the words which might match the keyword regex rather than exact keywo= rd i.e. if keyword is "Recommend" and in the document it is "Recommended" t= hen it should take care of it.=0A=0A>> What you are describing is called 'S= temming'.=A0 Lucene should be able to help you here. =0A=0AAny pointers ?= =0A=0AThanks=0AStuti Awasthi=0A=0A=0A-----Original Message-----=0AFrom: Stu= ti Awasthi =0ASent: Wednesday, May 22, 2013 12:01 PM=0ATo: user@mahout.apac= he.org=0ASubject: RE: Feature vector generation from Bag-of-Words=0A=0AThan= ks Suneel,=0A=0AI will go through your approach and will also learn more ab= out various api's you have suggested. I am new to Mahout so will need to di= g more. :)=0A=0ABy the time I was thinking the approach like this :=0A1. Cr= eate the sequence file of Bad of words and Input Data in different document= s 2. For individual documents , Il loop through 100 keywords and count the = number of time each keyword occur in a document 3.Create the RandomAccessSp= arseVector to store keyword and its frequency for each document=0A=0AThis i= s not the good approach to do may be due to Step 2 , but this approach can = also be implemented using MR. Please provide your thoughts on this.=0A=0ATh= anks=0AStuti =0A=0A=0A-----Original Message-----=0AFrom: Suneel Marthi [mai= lto:suneel_marthi@yahoo.com]=0ASent: Tuesday, May 21, 2013 10:21 PM=0ATo: u= ser@mahout.apache.org=0ASubject: Re: Feature vector generation from Bag-of-= Words=0A=0AIt should be easy to convert the below pseudocode to MapReduce t= o scale for large collection of documents.=0A=0A=0A=0A_____________________= ___________=0AFrom: Suneel Marthi =0ATo: "user@mah= out.apache.org" =0ASent: Tuesday, May 21, 2013 12:2= 0 PM=0ASubject: Re: Feature vector generation from Bag-of-Words=0A=0A=0AStu= ti,=0A=0AHere's how I would do it.=0A=0A1.=A0 Create a collection of the 10= 0 keywords that r of interest.=0A=0A=A0=A0=A0=A0 Collection keyword= s =3D new ArrayList();=0A=A0=A0=A0=A0 keywords.addAll();=0A=A0=A0=A0=A0 =0A=0A2.=A0 For each word in each of the text docum= ents create a Multiset (which is a bag of words) ,=0A=A0=A0=A0=A0=A0 retain= only those terms of interest from (1) that are of interest and use Mahout'= s StaticWordValu=0A=0A=A0=A0=A0=A0 // Itertate through all the documents=0A= =A0=A0=A0=A0 for document in documents {=0A=0A=A0=A0=A0=A0=A0 //create a ba= g of words for each document=0A=A0=A0=A0=A0=A0=A0 Multiset multiset= =3D new HashMultiset();=0A=0A=A0=A0=A0=A0 // create a RandomAccess= SparseVector=0A=A0=A0=A0=A0 Vector v =3D new RandomAccessSparseVector(100);= // 100 features for the 100 keywords =0A=0A=A0=A0=A0 =A0=A0=A0 for term in= document.terms {=0A=A0=A0=A0 =A0=A0=A0=A0=A0=A0=A0 multiset.add(term);=0A= =A0=A0=A0=A0=A0=A0=A0 }=0A=0A=A0=A0=A0=A0=A0=A0=A0 // retain only those key= words that are of interest (from step 1)=0A=A0=A0=A0=A0=A0=A0=A0 multiset.r= etainAll(keywords);=0A=0A=A0=A0=A0=A0=A0=A0 // You now have a bag of words = containing only the keywords with their term frequencies=0A=A0 =A0 =A0 =0A= =A0 =A0 =A0 // Use one of the Feature Encoders, refer to Section 14.3 of Ma= hout in Action for more detailed description of=0A=A0=A0=A0=A0=A0 // this p= rocess=0A=0A=A0=A0=A0=A0=A0=A0 FeatureVectorEncoder encoder =3D new StaticW= ordValueEncoder("body");=0A=A0=A0=A0=A0=A0 =0A=A0=A0=A0=A0 for (Multiset.En= try entry : multiset.entrySet()) {=0A=A0=A0=A0=A0=A0=A0 encoder.add= ToVector(entry.getElement(), entry.getCount(), v);=0A=A0=A0=A0=A0 }=0A=0A= =0A=0A=A0 =A0 =A0=0A=0A=0A=A0=A0=A0=A0 =0A=0A=0A=0A=0A_____________________= ___________=0AFrom: Stuti Awasthi =0ATo: "user@mahout= .apache.org" =0ASent: Tuesday, May 21, 2013 7:17 A= M=0ASubject: Feature vector generation from Bag-of-Words=0A=0A=0AHi all,=0A= =0AI have a query regarding the Feature Vector generation for Text document= s.=0AI have read Mahout in Action and understood how to create the text doc= ument in feature vector weighed by Tf of Tfidf schemes. My usecase is a lit= tle tweaked with that.=0A=0AI have few keywords may be say 100 and I want t= o create the Feature Vector of the text documents only with these 100 keywo= rds. So I would like to calculate the frequency of each keyword in each doc= ument and generate the feature vector of the keyword with the frequency as = weights.=0A=0AIs there any already present way to do this or Il need to wri= te the custom code?=0A=0AThanks=0AStuti Awasthi=0A=0A=0A::DISCLAIMER::=0A--= ---------------------------------------------------------------------------= -----------------------------------------------------------------------=0A= =0AThe contents of this e-mail and any attachment(s) are confidential and i= ntended for the named recipient(s) only.=0AE-mail transmission is not guara= nteed to be secure or error-free as information could be intercepted, corru= pted,=0Alost, destroyed, arrive late or incomplete, or may contain viruses = in transmission. The e mail and its contents=0A(with or without referred er= rors) shall therefore not attach any liability on the originator or HCL or = its affiliates.=0AViews or opinions, if any, presented in this email are so= lely those of the author and may not necessarily reflect the=0Aviews or opi= nions of HCL or its affiliates. Any form of reproduction, dissemination, co= pying, disclosure, modification,=0Adistribution and / or publication of thi= s message without the prior written consent of authorized representative of= =0AHCL is strictly prohibited. If you have received this email in error ple= ase delete it and notify the sender immediately.=0ABefore opening any email= and/or attachments, please check them for viruses and other defects.=0A=0A= ---------------------------------------------------------------------------= ------------------------------------------------------------------------- --48240482-1686965475-1369943747=:36575--