spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Meethu Mathew <meethu.mat...@flytxt.com>
Subject Re: [MLlib] Contributing Algorithm for Outlier Detection
Date Fri, 14 Nov 2014 04:32:55 GMT
Hi Ashutosh,

Please edit the README file.I think the following function call is 
changed now.

|model = OutlierWithAVFModel.outliers(master:String, input dir:String , percentage:Double||)
|

Regards,

*Meethu Mathew*

*Engineer*

*Flytxt*

_<http://www.linkedin.com/home?trk=hb_tab_home_top>_

On Friday 14 November 2014 12:01 AM, Ashutosh wrote:
> Hi Anant,
>
> Please see the changes.
>
> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
>
>
> I have changed the input format to Vector of String. I think we can also make it generic.
>
>
> Line 59 & 72 : that counter will not affect in parallelism, Since it only work on
one datapoint. It  only                         does the Indexing of the column.
>
>
> Rest all side effects have been removed.
>
> ​
>
> Thanks,
>
> Ashutosh
>
>
>
>
> ________________________________
> From: slcclimber [via Apache Spark Developers List] <ml-node+s1001551n9287h73@n3.nabble.com>
> Sent: Tuesday, November 11, 2014 11:46 PM
> To: Ashutosh Trivedi (MT2013030)
> Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection
>
>
> Mayur,
> Libsvm format sounds good to me. I could work on writing the tests if that helps you?
> Anant
>
> On Nov 11, 2014 11:06 AM, "Ashutosh [via Apache Spark Developers List]" <[hidden email]</user/SendEmail.jtp?type=node&node=9287&i=0>>
wrote:
>
> Hi Mayur,
>
> Vector data types are implemented using breeze library, it is presented at
>
> .../org/apache/spark/mllib/linalg
>
>
> Anant,
>
> One restriction I found that a vector can only be of 'Double', so it actually restrict
the user.
>
> What are you thoughts on LibSVM format?
>
> Thanks for the comments, I was just trying to get away from those increment /decrement
functions, they look ugly. Points are noted. I'll try to fix them soon. Tests are also required
for the code.
>
>
> Regards,
>
> Ashutosh
>
>
> ________________________________
> From: Mayur Rustagi [via Apache Spark Developers List] <ml-node+[hidden email]<http://user/SendEmail.jtp?type=node&node=9286&i=0>>
> Sent: Saturday, November 8, 2014 12:52 PM
> To: Ashutosh Trivedi (MT2013030)
> Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection
>
>> We should take a vector instead giving the user flexibility to decide
>> data source/ type
> What do you mean by vector datatype exactly?
>
> Mayur Rustagi
> Ph: <a href="tel:%2B1%20%28760%29%20203%203257" value="+17602033257" target="_blank">+1
(760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>
>
> On Wed, Nov 5, 2014 at 6:45 AM, slcclimber <[hidden email]<http://user/SendEmail.jtp?type=node&node=9239&i=0>>
wrote:
>
>> Ashutosh,
>> I still see a few issues.
>> 1. On line 112 you are counting using a counter. Since this will happen in
>> a RDD the counter will cause issues. Also that is not good functional style
>> to use a filter function with a side effect.
>> You could use randomSplit instead. This does not the same thing without the
>> side effect.
>> 2. Similar shared usage of j in line 102 is going to be an issue as well.
>> also hash seed does not need to be sequential it could be randomly
>> generated or hashed on the values.
>> 3. The compute function and trim scores still runs on a comma separeated
>> RDD. We should take a vector instead giving the user flexibility to decide
>> data source/ type. what if we want data from hive tables or parquet or JSON
>> or avro formats. This is a very restrictive format. With vectors the user
>> has the choice of taking in whatever data format and converting them to
>> vectors insteda of reading json files creating a csv file and then workig
>> on that.
>> 4. Similar use of counters in 54 and 65 is an issue.
>> Basically the shared state counters is a huge issue that does not scale.
>> Since the processing of RDD's is distributed and the value j lives on the
>> master.
>>
>> Anant
>>
>>
>>
>> On Tue, Nov 4, 2014 at 7:22 AM, Ashutosh [via Apache Spark Developers List]
>> <[hidden email]<http://user/SendEmail.jtp?type=node&node=9239&i=1>>
wrote:
>>
>>>   Anant,
>>>
>>> I got rid of those increment/ decrements functions and now code is much
>>> cleaner. Please check. All your comments have been looked after.
>>>
>>>
>>>
>> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
>>>
>>>   _Ashu
>>>
>>> <
>> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
>>>    Outlier-Detection-with-AVF-Spark/OutlierWithAVFModel.scala at master ·
>>> codeAshu/Outlier-Detection-with-AVF-Spark · GitHub
>>>   Contribute to Outlier-Detection-with-AVF-Spark development by creating
>> an
>>> account on GitHub.
>>>   Read more...
>>> <
>> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
>>>
>>>   ------------------------------
>>> *From:* slcclimber [via Apache Spark Developers List] <ml-node+[hidden
>>> email] <http://user/SendEmail.jtp?type=node&node=9083&i=0>>
>>> *Sent:* Friday, October 31, 2014 10:09 AM
>>> *To:* Ashutosh Trivedi (MT2013030)
>>> *Subject:* Re: [MLlib] Contributing Algorithm for Outlier Detection
>>>
>>>
>>> You should create a jira ticket to go with it as well.
>>> Thanks
>>> On Oct 30, 2014 10:38 PM, "Ashutosh [via Apache Spark Developers List]"
>> <[hidden
>>> email] <http://user/SendEmail.jtp?type=node&node=9037&i=0>>
wrote:
>>>
>>>>   ​Okay. I'll try it and post it soon with test case. After that I think
>>>> we can go ahead with the PR.
>>>>   ------------------------------
>>>> *From:* slcclimber [via Apache Spark Developers List] <ml-node+[hidden
>>>> email] <http://user/SendEmail.jtp?type=node&node=9036&i=0>>
>>>> *Sent:* Friday, October 31, 2014 10:03 AM
>>>> *To:* Ashutosh Trivedi (MT2013030)
>>>> *Subject:* Re: [MLlib] Contributing Algorithm for Outlier Detection
>>>>
>>>>
>>>> Ashutosh,
>>>> A vector would be a good idea vectors are used very frequently.
>>>> Test data is usually stored in the spark/data/mllib folder
>>>>   On Oct 30, 2014 10:31 PM, "Ashutosh [via Apache Spark Developers List]"
>>>> <[hidden email] <http://user/SendEmail.jtp?type=node&node=9035&i=0>>
>>>> wrote:
>>>>
>>>>> Hi Anant,
>>>>> sorry for my late reply. Thank you for taking time and reviewing it.
>>>>>
>>>>> I have few comments on first issue.
>>>>>
>>>>> You are correct on the string (csv) part. But we can not take input of
>>>>> type you mentioned. We calculate frequency in our function. Otherwise
>> user
>>>>> has to do all this computation. I realize that taking a RDD[Vector]
>> would
>>>>> be general enough for all. What do you say?
>>>>>
>>>>> I agree on rest all the issues. I will correct them soon and post it.
>>>>> I have a doubt on test cases. Where should I put data while giving test
>>>>> scripts? or should i generate synthetic data for testing with in the
>>>>> scripts, how does this work?
>>>>>
>>>>> Regards,
>>>>> Ashutosh
>>>>>
>>>>> ------------------------------
>>>>>   If you reply to this email, your message will be added to the
>>>>> discussion below:
>>>>>
>>>>>
>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9034.html
>>>>>   To unsubscribe from [MLlib] Contributing Algorithm for Outlier
>>>>> Detection, click here.
>>>>> NAML
>>>>> <
>> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
>>>>
>>>> ------------------------------
>>>>   If you reply to this email, your message will be added to the
>>>> discussion below:
>>>>
>>>>
>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9035.html
>>>>   To unsubscribe from [MLlib] Contributing Algorithm for Outlier
>>>> Detection, click here.
>>>> NAML
>>>> <
>> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
>>>>
>>>> ------------------------------
>>>>   If you reply to this email, your message will be added to the
>>>> discussion below:
>>>>
>>>>
>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9036.html
>>>>   To unsubscribe from [MLlib] Contributing Algorithm for Outlier
>>>> Detection, click here.
>>>> NAML
>>>> <
>> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
>>>
>>> ------------------------------
>>>   If you reply to this email, your message will be added to the discussion
>>> below:
>>>
>>>
>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9037.html
>>>   To unsubscribe from [MLlib] Contributing Algorithm for Outlier
>> Detection, click
>>> here.
>>> NAML
>>> <
>> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
>>>
>>>
>>> ------------------------------
>>>   If you reply to this email, your message will be added to the discussion
>>> below:
>>>
>>>
>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9083.html
>>>   To unsubscribe from [MLlib] Contributing Algorithm for Outlier
>> Detection, click
>>> here
>>> <
>>>
>>> .
>>> NAML
>>> <
>> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
>>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9095.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>
> ________________________________
> If you reply to this email, your message will be added to the discussion below:
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9239.html
> To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click here.
> NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>
>
> ________________________________
> If you reply to this email, your message will be added to the discussion below:
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9286.html
> To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click here.
> NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>
>
> ________________________________
> If you reply to this email, your message will be added to the discussion below:
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9287.html
> To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click here<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=8880&code=YXNodXRvc2gudHJpdmVkaUBpaWl0Yi5vcmd8ODg4MHwtMzkzMzE5NzYx>.
> NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>
>
>
>
> --
> View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9327.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message