spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ashutosh <ashutosh.triv...@iiitb.org>
Subject Re: [MLlib] Contributing Algorithm for Outlier Detection
Date Thu, 13 Nov 2014 18:31:11 GMT
Hi Anant,

Please see the changes.

https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala


I have changed the input format to Vector of String. I think we can also make it generic.


Line 59 & 72 : that counter will not affect in parallelism, Since it only work on one
datapoint. It  only                         does the Indexing of the column.


Rest all side effects have been removed.

​

Thanks,

Ashutosh




________________________________
From: slcclimber [via Apache Spark Developers List] <ml-node+s1001551n9287h73@n3.nabble.com>
Sent: Tuesday, November 11, 2014 11:46 PM
To: Ashutosh Trivedi (MT2013030)
Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection


Mayur,
Libsvm format sounds good to me. I could work on writing the tests if that helps you?
Anant

On Nov 11, 2014 11:06 AM, "Ashutosh [via Apache Spark Developers List]" <[hidden email]</user/SendEmail.jtp?type=node&node=9287&i=0>>
wrote:

Hi Mayur,

Vector data types are implemented using breeze library, it is presented at

.../org/apache/spark/mllib/linalg


Anant,

One restriction I found that a vector can only be of 'Double', so it actually restrict the
user.

What are you thoughts on LibSVM format?

Thanks for the comments, I was just trying to get away from those increment /decrement functions,
they look ugly. Points are noted. I'll try to fix them soon. Tests are also required for the
code.


Regards,

Ashutosh


________________________________
From: Mayur Rustagi [via Apache Spark Developers List] <ml-node+[hidden email]<http://user/SendEmail.jtp?type=node&node=9286&i=0>>
Sent: Saturday, November 8, 2014 12:52 PM
To: Ashutosh Trivedi (MT2013030)
Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection

>
> We should take a vector instead giving the user flexibility to decide
> data source/ type

What do you mean by vector datatype exactly?

Mayur Rustagi
Ph: <a href="tel:%2B1%20%28760%29%20203%203257" value="+17602033257" target="_blank">+1
(760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>


On Wed, Nov 5, 2014 at 6:45 AM, slcclimber <[hidden email]<http://user/SendEmail.jtp?type=node&node=9239&i=0>>
wrote:

> Ashutosh,
> I still see a few issues.
> 1. On line 112 you are counting using a counter. Since this will happen in
> a RDD the counter will cause issues. Also that is not good functional style
> to use a filter function with a side effect.
> You could use randomSplit instead. This does not the same thing without the
> side effect.
> 2. Similar shared usage of j in line 102 is going to be an issue as well.
> also hash seed does not need to be sequential it could be randomly
> generated or hashed on the values.
> 3. The compute function and trim scores still runs on a comma separeated
> RDD. We should take a vector instead giving the user flexibility to decide
> data source/ type. what if we want data from hive tables or parquet or JSON
> or avro formats. This is a very restrictive format. With vectors the user
> has the choice of taking in whatever data format and converting them to
> vectors insteda of reading json files creating a csv file and then workig
> on that.
> 4. Similar use of counters in 54 and 65 is an issue.
> Basically the shared state counters is a huge issue that does not scale.
> Since the processing of RDD's is distributed and the value j lives on the
> master.
>
> Anant
>
>
>
> On Tue, Nov 4, 2014 at 7:22 AM, Ashutosh [via Apache Spark Developers List]
> <[hidden email]<http://user/SendEmail.jtp?type=node&node=9239&i=1>>
wrote:
>
> >  Anant,
> >
> > I got rid of those increment/ decrements functions and now code is much
> > cleaner. Please check. All your comments have been looked after.
> >
> >
> >
> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
> >
> >
> >  _Ashu
> >
> > <
> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
> >
> >   Outlier-Detection-with-AVF-Spark/OutlierWithAVFModel.scala at master ·
> > codeAshu/Outlier-Detection-with-AVF-Spark · GitHub
> >  Contribute to Outlier-Detection-with-AVF-Spark development by creating
> an
> > account on GitHub.
> >  Read more...
> > <
> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
> >
> >
> >  ------------------------------
> > *From:* slcclimber [via Apache Spark Developers List] <ml-node+[hidden
> > email] <http://user/SendEmail.jtp?type=node&node=9083&i=0>>
> > *Sent:* Friday, October 31, 2014 10:09 AM
> > *To:* Ashutosh Trivedi (MT2013030)
> > *Subject:* Re: [MLlib] Contributing Algorithm for Outlier Detection
> >
> >
> > You should create a jira ticket to go with it as well.
> > Thanks
> > On Oct 30, 2014 10:38 PM, "Ashutosh [via Apache Spark Developers List]"
> <[hidden
> > email] <http://user/SendEmail.jtp?type=node&node=9037&i=0>> wrote:
> >
> >>  ​Okay. I'll try it and post it soon with test case. After that I think
> >> we can go ahead with the PR.
> >>  ------------------------------
> >> *From:* slcclimber [via Apache Spark Developers List] <ml-node+[hidden
> >> email] <http://user/SendEmail.jtp?type=node&node=9036&i=0>>
> >> *Sent:* Friday, October 31, 2014 10:03 AM
> >> *To:* Ashutosh Trivedi (MT2013030)
> >> *Subject:* Re: [MLlib] Contributing Algorithm for Outlier Detection
> >>
> >>
> >> Ashutosh,
> >> A vector would be a good idea vectors are used very frequently.
> >> Test data is usually stored in the spark/data/mllib folder
> >>  On Oct 30, 2014 10:31 PM, "Ashutosh [via Apache Spark Developers List]"
> >> <[hidden email] <http://user/SendEmail.jtp?type=node&node=9035&i=0>>
> >> wrote:
> >>
> >>> Hi Anant,
> >>> sorry for my late reply. Thank you for taking time and reviewing it.
> >>>
> >>> I have few comments on first issue.
> >>>
> >>> You are correct on the string (csv) part. But we can not take input of
> >>> type you mentioned. We calculate frequency in our function. Otherwise
> user
> >>> has to do all this computation. I realize that taking a RDD[Vector]
> would
> >>> be general enough for all. What do you say?
> >>>
> >>> I agree on rest all the issues. I will correct them soon and post it.
> >>> I have a doubt on test cases. Where should I put data while giving test
> >>> scripts? or should i generate synthetic data for testing with in the
> >>> scripts, how does this work?
> >>>
> >>> Regards,
> >>> Ashutosh
> >>>
> >>> ------------------------------
> >>>  If you reply to this email, your message will be added to the
> >>> discussion below:
> >>>
> >>>
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9034.html
> >>>  To unsubscribe from [MLlib] Contributing Algorithm for Outlier
> >>> Detection, click here.
> >>> NAML
> >>> <
> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> >
> >>>
> >>
> >>
> >> ------------------------------
> >>  If you reply to this email, your message will be added to the
> >> discussion below:
> >>
> >>
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9035.html
> >>  To unsubscribe from [MLlib] Contributing Algorithm for Outlier
> >> Detection, click here.
> >> NAML
> >> <
> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> >
> >>
> >>
> >> ------------------------------
> >>  If you reply to this email, your message will be added to the
> >> discussion below:
> >>
> >>
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9036.html
> >>  To unsubscribe from [MLlib] Contributing Algorithm for Outlier
> >> Detection, click here.
> >> NAML
> >> <
> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> >
> >>
> >
> >
> > ------------------------------
> >  If you reply to this email, your message will be added to the discussion
> > below:
> >
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9037.html
> >  To unsubscribe from [MLlib] Contributing Algorithm for Outlier
> Detection, click
> > here.
> > NAML
> > <
> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> >
> >
> >
> > ------------------------------
> >  If you reply to this email, your message will be added to the discussion
> > below:
> >
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9083.html
> >  To unsubscribe from [MLlib] Contributing Algorithm for Outlier
> Detection, click
> > here
> > <
> >
> > .
> > NAML
> > <
> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> >
> >
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9095.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>


________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9239.html
To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click here.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>


________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9286.html
To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click here.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>


________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9287.html
To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click here<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=8880&code=YXNodXRvc2gudHJpdmVkaUBpaWl0Yi5vcmd8ODg4MHwtMzkzMzE5NzYx>.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9327.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message