Mailing-List: contact user-help@spark.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
In-Reply-To: 
 <CA+AqtqXYZ=EVWw0nOmXeDO-K7MEtNNWYjfhCV9+axUL=JvYJPw@mail.gmail.com>
References: 
 <CA+AqtqXYZ=EVWw0nOmXeDO-K7MEtNNWYjfhCV9+axUL=JvYJPw@mail.gmail.com>
Date: Sun, 18 Oct 2015 20:54:36 +0100
Message-ID: 
 <CAMu=zUrXt8+s9kf-gJCYCdBiCHCQPKE02NaSOepqc_PKOMcOQw@mail.gmail.com>
Subject: Re: How VectorIndexer works in Spark ML pipelines
From: =?UTF-8?Q?Jorge_S=C3=A1nchez?= <jorgesg1986@gmail.com>
To: VISHNU SUBRAMANIAN <johnfedrickenator@gmail.com>
Cc: User <user@spark.apache.org>
Content-Type: multipart/alternative; boundary=047d7b3a83aa01ddf005226666aa

--047d7b3a83aa01ddf005226666aa
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Vishnu,

VectorIndexer
<http://spark.apache.org/docs/latest/ml-features.html#vectorindexer> will
add metadata regarding which features are categorical and what are
continuous depending on the threshold, if there are more different unique
values than the *MaxCategories *parameter, they will be treated as
continuous. That will help the learning algorithms as they will be treated
differently.
>From the data I can see you have more than one Vector in the features
column? Try using some Vectors with only two different values.

Regards.

2015-10-15 10:14 GMT+01:00 VISHNU SUBRAMANIAN <johnfedrickenator@gmail.com>=
:

> HI All,
>
> I am trying to use the VectorIndexer (FeatureExtraction) technique
> available from the Spark ML Pipelines.
>
> I ran the example in the documentation .
>
> val featureIndexer =3D new VectorIndexer()
>   .setInputCol("features")
>   .setOutputCol("indexedFeatures")
>   .setMaxCategories(4)
>   .fit(data)
>
>
> And then I wanted to see what output it generates.
>
> After performing transform on the data set , the output looks like below.
>
> scala> predictions.select("indexedFeatures").take(1).foreach(println)
>
>
> [(692,[124,125,126,127,151,152,153,154,155,179,180,181,182,183,208,209,21=
0,211,235,236,237,238,239,263,264,265,266,267,268,292,293,294,295,296,321,3=
22,323,324,349,350,351,352,377,378,379,380,405,406,407,408,433,434,435,436,=
461,462,463,464,489,490,491,492,493,517,518,519,520,521,545,546,547,548,549=
,574,575,576,577,578,602,603,604,605,606,630,631,632,633,634,658,659,660,66=
1,662],[145.0,255.0,211.0,31.0,32.0,237.0,253.0,252.0,71.0,11.0,175.0,253.0=
,252.0,71.0,144.0,253.0,252.0,71.0,16.0,191.0,253.0,252.0,71.0,26.0,221.0,2=
53.0,252.0,124.0,31.0,125.0,253.0,252.0,252.0,108.0,253.0,252.0,252.0,108.0=
,255.0,253.0,253.0,108.0,253.0,252.0,252.0,108.0,253.0,252.0,252.0,108.0,25=
3.0,252.0,252.0,108.0,255.0,253.0,253.0,170.0,253.0,252.0,252.0,252.0,42.0,=
149.0,252.0,252.0,252.0,144.0,109.0,252.0,252.0,252.0,144.0,218.0,253.0,253=
.0,255.0,35.0,175.0,252.0,252.0,253.0,35.0,73.0,252.0,252.0,253.0,35.0,31.0=
,211.0,252.0,253.0,35.0])]
>
>
> scala> predictions.select("features").take(1).foreach(println)
>
>
> [(692,[124,125,126,127,151,152,153,154,155,179,180,181,182,183,208,209,21=
0,211,235,236,237,238,239,263,264,265,266,267,268,292,293,294,295,296,321,3=
22,323,324,349,350,351,352,377,378,379,380,405,406,407,408,433,434,435,436,=
461,462,463,464,489,490,491,492,493,517,518,519,520,521,545,546,547,548,549=
,574,575,576,577,578,602,603,604,605,606,630,631,632,633,634,658,659,660,66=
1,662],[145.0,255.0,211.0,31.0,32.0,237.0,253.0,252.0,71.0,11.0,175.0,253.0=
,252.0,71.0,144.0,253.0,252.0,71.0,16.0,191.0,253.0,252.0,71.0,26.0,221.0,2=
53.0,252.0,124.0,31.0,125.0,253.0,252.0,252.0,108.0,253.0,252.0,252.0,108.0=
,255.0,253.0,253.0,108.0,253.0,252.0,252.0,108.0,253.0,252.0,252.0,108.0,25=
3.0,252.0,252.0,108.0,255.0,253.0,253.0,170.0,253.0,252.0,252.0,252.0,42.0,=
149.0,252.0,252.0,252.0,144.0,109.0,252.0,252.0,252.0,144.0,218.0,253.0,253=
.0,255.0,35.0,175.0,252.0,252.0,253.0,35.0,73.0,252.0,252.0,253.0,35.0,31.0=
,211.0,252.0,253.0,35.0])]
>
> I can,t understand what is happening. I tried with simple data sets also =
,
> but similar result.
>
> Please help.
>
> Thanks,
>
> Vishnu
>
>
>
>
>
>
>
>

--047d7b3a83aa01ddf005226666aa
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Vishnu,<div><br></div><div><a href=3D"http://spark.apache.=
org/docs/latest/ml-features.html#vectorindexer">VectorIndexer</a>=C2=A0will=
 add metadata regarding which features are categorical and what are continu=
ous depending on the threshold, if there are more different unique values t=
han the <i>MaxCategories </i>parameter, they will be treated as continuous.=
 That will help the learning algorithms as they will be treated differently=
.</div><div>From the data I can see you have more than one Vector in the fe=
atures column? Try using some Vectors with only two different values.</div>=
<div><br></div><div>Regards.</div></div><div class=3D"gmail_extra"><br><div=
 class=3D"gmail_quote">2015-10-15 10:14 GMT+01:00 VISHNU SUBRAMANIAN <span =
dir=3D"ltr">&lt;<a href=3D"mailto:johnfedrickenator@gmail.com" target=3D"_b=
lank">johnfedrickenator@gmail.com</a>&gt;</span>:<br><blockquote class=3D"g=
mail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-l=
eft:1ex"><div dir=3D"ltr">HI All,<div><br></div><div>I am trying to use the=
=C2=A0VectorIndexer (FeatureExtraction) technique available from the Spark =
ML Pipelines.=C2=A0</div><div><br></div><div>I ran the example in the docum=
entation .=C2=A0</div><div><br></div><div><pre style=3D"padding:9.5px;font-=
family:Menlo,&#39;Lucida Console&#39;,monospace;font-size:13px;color:rgb(51=
,51,51);border-radius:4px;margin-top:0px;margin-bottom:10px;line-height:20p=
x;word-wrap:break-word;white-space:pre-wrap;border:1px solid rgba(0,0,0,0.1=
4902);background-color:rgb(245,245,245)"><code style=3D"padding:0px;font-fa=
mily:Menlo,&#39;Lucida Console&#39;,monospace;font-size:12px;color:inherit;=
border-radius:3px;border:0px;background:transparent"><span style=3D"color:r=
gb(0,112,32);font-weight:bold">val</span> <span>featureIndexer</span> <span=
 style=3D"color:rgb(0,112,32);font-weight:bold">=3D</span> <span style=3D"c=
olor:rgb(0,112,32);font-weight:bold">new</span> <span style=3D"color:rgb(14=
,132,181);font-weight:bold">VectorIndexer</span><span style=3D"color:rgb(10=
2,102,102)">()</span>
  <span style=3D"color:rgb(102,102,102)">.</span><span>setInputCol</span><s=
pan style=3D"color:rgb(102,102,102)">(</span><span style=3D"color:rgb(64,11=
2,160)">&quot;features&quot;</span><span style=3D"color:rgb(102,102,102)">)=
</span>
  <span style=3D"color:rgb(102,102,102)">.</span><span>setOutputCol</span><=
span style=3D"color:rgb(102,102,102)">(</span><span style=3D"color:rgb(64,1=
12,160)">&quot;indexedFeatures&quot;</span><span style=3D"color:rgb(102,102=
,102)">)</span>
  <span style=3D"color:rgb(102,102,102)">.</span><span>setMaxCategories</sp=
an><span style=3D"color:rgb(102,102,102)">(</span><span style=3D"color:rgb(=
64,160,112)">4</span><span style=3D"color:rgb(102,102,102)">)</span>
  <span style=3D"color:rgb(102,102,102)">.</span><span>fit</span><span styl=
e=3D"color:rgb(102,102,102)">(</span><span>data</span><span style=3D"color:=
rgb(102,102,102)">)</span></code></pre></div><div><br></div><div>And then I=
 wanted to see what output it generates.</div><div><br></div><div>After per=
forming transform on the data set , the output looks like below.</div><div>=
<br></div><div>


<p><span>scala&gt; predictions.select(&quot;indexedFeatures&quot;).take(1).=
foreach(println)</span></p>
<p><span>[(692,[124,125,126,127,151,152,153,154,155,179,180,181,182,183,208=
,209,210,211,235,236,237,238,239,263,264,265,266,267,268,292,293,294,295,29=
6,321,322,323,324,349,350,351,352,377,378,379,380,405,406,407,408,433,434,4=
35,436,461,462,463,464,489,490,491,492,493,517,518,519,520,521,545,546,547,=
548,549,574,575,576,577,578,602,603,604,605,606,630,631,632,633,634,658,659=
,660,661,662],[145.0,255.0,211.0,31.0,32.0,237.0,253.0,252.0,71.0,11.0,175.=
0,253.0,252.0,71.0,144.0,253.0,252.0,71.0,16.0,191.0,253.0,252.0,71.0,26.0,=
221.0,253.0,252.0,124.0,31.0,125.0,253.0,252.0,252.0,108.0,253.0,252.0,252.=
0,108.0,255.0,253.0,253.0,108.0,253.0,252.0,252.0,108.0,253.0,252.0,252.0,1=
08.0,253.0,252.0,252.0,108.0,255.0,253.0,253.0,170.0,253.0,252.0,252.0,252.=
0,42.0,149.0,252.0,252.0,252.0,144.0,109.0,252.0,252.0,252.0,144.0,218.0,25=
3.0,253.0,255.0,35.0,175.0,252.0,252.0,253.0,35.0,73.0,252.0,252.0,253.0,35=
.0,31.0,211.0,252.0,253.0,35.0])]</span></p>
<p><span></span><br></p>
<p><span>scala&gt; predictions.select(&quot;features&quot;).take(1).foreach=
(println)</span></p>
<p><span>[(692,[124,125,126,127,151,152,153,154,155,179,180,181,182,183,208=
,209,210,211,235,236,237,238,239,263,264,265,266,267,268,292,293,294,295,29=
6,321,322,323,324,349,350,351,352,377,378,379,380,405,406,407,408,433,434,4=
35,436,461,462,463,464,489,490,491,492,493,517,518,519,520,521,545,546,547,=
548,549,574,575,576,577,578,602,603,604,605,606,630,631,632,633,634,658,659=
,660,661,662],[145.0,255.0,211.0,31.0,32.0,237.0,253.0,252.0,71.0,11.0,175.=
0,253.0,252.0,71.0,144.0,253.0,252.0,71.0,16.0,191.0,253.0,252.0,71.0,26.0,=
221.0,253.0,252.0,124.0,31.0,125.0,253.0,252.0,252.0,108.0,253.0,252.0,252.=
0,108.0,255.0,253.0,253.0,108.0,253.0,252.0,252.0,108.0,253.0,252.0,252.0,1=
08.0,253.0,252.0,252.0,108.0,255.0,253.0,253.0,170.0,253.0,252.0,252.0,252.=
0,42.0,149.0,252.0,252.0,252.0,144.0,109.0,252.0,252.0,252.0,144.0,218.0,25=
3.0,253.0,255.0,35.0,175.0,252.0,252.0,253.0,35.0,73.0,252.0,252.0,253.0,35=
.0,31.0,211.0,252.0,253.0,35.0])]</span></p>
<p><span></span>I can,t understand what is happening. I tried with simple d=
ata sets also , but similar result.</p><p>Please help.</p><p>Thanks,</p><p>=
Vishnu</p><p><br></p></div><div><br></div><div><br></div><div><br></div><di=
v><br></div><div><br></div></div>
</blockquote></div><br></div>

--047d7b3a83aa01ddf005226666aa--