Return-Path: X-Original-To: apmail-spark-user-archive@minotaur.apache.org Delivered-To: apmail-spark-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6E67618956 for ; Sun, 18 Oct 2015 19:54:45 +0000 (UTC) Received: (qmail 69752 invoked by uid 500); 18 Oct 2015 19:54:41 -0000 Delivered-To: apmail-spark-user-archive@spark.apache.org Received: (qmail 69653 invoked by uid 500); 18 Oct 2015 19:54:41 -0000 Mailing-List: contact user-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@spark.apache.org Received: (qmail 69643 invoked by uid 99); 18 Oct 2015 19:54:41 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 18 Oct 2015 19:54:41 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id D7C81C23C3 for ; Sun, 18 Oct 2015 19:54:40 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 4.348 X-Spam-Level: **** X-Spam-Status: No, score=4.348 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=3, KAM_LINEPADDING=1.2, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id WzBQIk8M6VFi for ; Sun, 18 Oct 2015 19:54:39 +0000 (UTC) Received: from mail-lb0-f172.google.com (mail-lb0-f172.google.com [209.85.217.172]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id A75E3205E9 for ; Sun, 18 Oct 2015 19:54:38 +0000 (UTC) Received: by lbbes7 with SMTP id es7so38894041lbb.2 for ; Sun, 18 Oct 2015 12:54:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=Sz06wNRbw0RrOfShh23vO/wMdtRzlJByZ9UH2WNpTow=; b=ZlCHvFr2SA+6v1EZ9EUwfXYivJSRg3FwfzhcJSb9cBwCzTlf7OD0dVqwZHob8NIVCY 2L70/ojQ+Vm5wwCaoAINxzib2v7IpaSqn14fYu7AauPg1oNREVp27aImrRpxmgTxpljG oW4Y4fZEGZ2NmJW4sYdyZrkKyLxE4bciVeVq5gBhOOGIPQuahD8Kr1tCVYllC1uZ8wii AWLpa371FL7xmX7QTo8cBxuMm7R6iSNXR6eFgy+ERLyRewL44PyAVNaUBBEn3Xjw0OQZ J1W5XU8S8TgYjxHOz5XWLxdnb/o3nr2YeImyu+Al/Pyr6xd7XiEbpOFX/LLwK0yejByi ZWCA== MIME-Version: 1.0 X-Received: by 10.112.130.195 with SMTP id og3mr2094865lbb.69.1445198076957; Sun, 18 Oct 2015 12:54:36 -0700 (PDT) Received: by 10.25.30.14 with HTTP; Sun, 18 Oct 2015 12:54:36 -0700 (PDT) In-Reply-To: References: Date: Sun, 18 Oct 2015 20:54:36 +0100 Message-ID: Subject: Re: How VectorIndexer works in Spark ML pipelines From: =?UTF-8?Q?Jorge_S=C3=A1nchez?= To: VISHNU SUBRAMANIAN Cc: User Content-Type: multipart/alternative; boundary=047d7b3a83aa01ddf005226666aa --047d7b3a83aa01ddf005226666aa Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Vishnu, VectorIndexer will add metadata regarding which features are categorical and what are continuous depending on the threshold, if there are more different unique values than the *MaxCategories *parameter, they will be treated as continuous. That will help the learning algorithms as they will be treated differently. >From the data I can see you have more than one Vector in the features column? Try using some Vectors with only two different values. Regards. 2015-10-15 10:14 GMT+01:00 VISHNU SUBRAMANIAN = : > HI All, > > I am trying to use the VectorIndexer (FeatureExtraction) technique > available from the Spark ML Pipelines. > > I ran the example in the documentation . > > val featureIndexer =3D new VectorIndexer() > .setInputCol("features") > .setOutputCol("indexedFeatures") > .setMaxCategories(4) > .fit(data) > > > And then I wanted to see what output it generates. > > After performing transform on the data set , the output looks like below. > > scala> predictions.select("indexedFeatures").take(1).foreach(println) > > > [(692,[124,125,126,127,151,152,153,154,155,179,180,181,182,183,208,209,21= 0,211,235,236,237,238,239,263,264,265,266,267,268,292,293,294,295,296,321,3= 22,323,324,349,350,351,352,377,378,379,380,405,406,407,408,433,434,435,436,= 461,462,463,464,489,490,491,492,493,517,518,519,520,521,545,546,547,548,549= ,574,575,576,577,578,602,603,604,605,606,630,631,632,633,634,658,659,660,66= 1,662],[145.0,255.0,211.0,31.0,32.0,237.0,253.0,252.0,71.0,11.0,175.0,253.0= ,252.0,71.0,144.0,253.0,252.0,71.0,16.0,191.0,253.0,252.0,71.0,26.0,221.0,2= 53.0,252.0,124.0,31.0,125.0,253.0,252.0,252.0,108.0,253.0,252.0,252.0,108.0= ,255.0,253.0,253.0,108.0,253.0,252.0,252.0,108.0,253.0,252.0,252.0,108.0,25= 3.0,252.0,252.0,108.0,255.0,253.0,253.0,170.0,253.0,252.0,252.0,252.0,42.0,= 149.0,252.0,252.0,252.0,144.0,109.0,252.0,252.0,252.0,144.0,218.0,253.0,253= .0,255.0,35.0,175.0,252.0,252.0,253.0,35.0,73.0,252.0,252.0,253.0,35.0,31.0= ,211.0,252.0,253.0,35.0])] > > > scala> predictions.select("features").take(1).foreach(println) > > > [(692,[124,125,126,127,151,152,153,154,155,179,180,181,182,183,208,209,21= 0,211,235,236,237,238,239,263,264,265,266,267,268,292,293,294,295,296,321,3= 22,323,324,349,350,351,352,377,378,379,380,405,406,407,408,433,434,435,436,= 461,462,463,464,489,490,491,492,493,517,518,519,520,521,545,546,547,548,549= ,574,575,576,577,578,602,603,604,605,606,630,631,632,633,634,658,659,660,66= 1,662],[145.0,255.0,211.0,31.0,32.0,237.0,253.0,252.0,71.0,11.0,175.0,253.0= ,252.0,71.0,144.0,253.0,252.0,71.0,16.0,191.0,253.0,252.0,71.0,26.0,221.0,2= 53.0,252.0,124.0,31.0,125.0,253.0,252.0,252.0,108.0,253.0,252.0,252.0,108.0= ,255.0,253.0,253.0,108.0,253.0,252.0,252.0,108.0,253.0,252.0,252.0,108.0,25= 3.0,252.0,252.0,108.0,255.0,253.0,253.0,170.0,253.0,252.0,252.0,252.0,42.0,= 149.0,252.0,252.0,252.0,144.0,109.0,252.0,252.0,252.0,144.0,218.0,253.0,253= .0,255.0,35.0,175.0,252.0,252.0,253.0,35.0,73.0,252.0,252.0,253.0,35.0,31.0= ,211.0,252.0,253.0,35.0])] > > I can,t understand what is happening. I tried with simple data sets also = , > but similar result. > > Please help. > > Thanks, > > Vishnu > > > > > > > > --047d7b3a83aa01ddf005226666aa Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Vishnu,

VectorIndexer=C2=A0will= add metadata regarding which features are categorical and what are continu= ous depending on the threshold, if there are more different unique values t= han the MaxCategories parameter, they will be treated as continuous.= That will help the learning algorithms as they will be treated differently= .
From the data I can see you have more than one Vector in the fe= atures column? Try using some Vectors with only two different values.
=

Regards.

2015-10-15 10:14 GMT+01:00 VISHNU SUBRAMANIAN <johnfedrickenator@gmail.com>:
HI All,

I am trying to use the= =C2=A0VectorIndexer (FeatureExtraction) technique available from the Spark = ML Pipelines.=C2=A0

I ran the example in the docum= entation .=C2=A0

val featureIndexer =3D new VectorIndexer()
  .setInputCol("features")=

  .setOutputCol<=
span style=3D"color:rgb(102,102,102)">("indexedFeatures")
  .setMaxCategories(4)
  .fit(data)

And then I= wanted to see what output it generates.

After per= forming transform on the data set , the output looks like below.
=

scala> predictions.select("indexedFeatures").take(1).= foreach(println)

[(692,[124,125,126,127,151,152,153,154,155,179,180,181,182,183,208= ,209,210,211,235,236,237,238,239,263,264,265,266,267,268,292,293,294,295,29= 6,321,322,323,324,349,350,351,352,377,378,379,380,405,406,407,408,433,434,4= 35,436,461,462,463,464,489,490,491,492,493,517,518,519,520,521,545,546,547,= 548,549,574,575,576,577,578,602,603,604,605,606,630,631,632,633,634,658,659= ,660,661,662],[145.0,255.0,211.0,31.0,32.0,237.0,253.0,252.0,71.0,11.0,175.= 0,253.0,252.0,71.0,144.0,253.0,252.0,71.0,16.0,191.0,253.0,252.0,71.0,26.0,= 221.0,253.0,252.0,124.0,31.0,125.0,253.0,252.0,252.0,108.0,253.0,252.0,252.= 0,108.0,255.0,253.0,253.0,108.0,253.0,252.0,252.0,108.0,253.0,252.0,252.0,1= 08.0,253.0,252.0,252.0,108.0,255.0,253.0,253.0,170.0,253.0,252.0,252.0,252.= 0,42.0,149.0,252.0,252.0,252.0,144.0,109.0,252.0,252.0,252.0,144.0,218.0,25= 3.0,253.0,255.0,35.0,175.0,252.0,252.0,253.0,35.0,73.0,252.0,252.0,253.0,35= .0,31.0,211.0,252.0,253.0,35.0])]


scala> predictions.select("features").take(1).foreach= (println)

[(692,[124,125,126,127,151,152,153,154,155,179,180,181,182,183,208= ,209,210,211,235,236,237,238,239,263,264,265,266,267,268,292,293,294,295,29= 6,321,322,323,324,349,350,351,352,377,378,379,380,405,406,407,408,433,434,4= 35,436,461,462,463,464,489,490,491,492,493,517,518,519,520,521,545,546,547,= 548,549,574,575,576,577,578,602,603,604,605,606,630,631,632,633,634,658,659= ,660,661,662],[145.0,255.0,211.0,31.0,32.0,237.0,253.0,252.0,71.0,11.0,175.= 0,253.0,252.0,71.0,144.0,253.0,252.0,71.0,16.0,191.0,253.0,252.0,71.0,26.0,= 221.0,253.0,252.0,124.0,31.0,125.0,253.0,252.0,252.0,108.0,253.0,252.0,252.= 0,108.0,255.0,253.0,253.0,108.0,253.0,252.0,252.0,108.0,253.0,252.0,252.0,1= 08.0,253.0,252.0,252.0,108.0,255.0,253.0,253.0,170.0,253.0,252.0,252.0,252.= 0,42.0,149.0,252.0,252.0,252.0,144.0,109.0,252.0,252.0,252.0,144.0,218.0,25= 3.0,253.0,255.0,35.0,175.0,252.0,252.0,253.0,35.0,73.0,252.0,252.0,253.0,35= .0,31.0,211.0,252.0,253.0,35.0])]

I can,t understand what is happening. I tried with simple d= ata sets also , but similar result.

Please help.

Thanks,

= Vishnu








--047d7b3a83aa01ddf005226666aa--