Mailing-List: contact dev-help@spark.apache.org; run by ezmlm
Precedence: bulk
Received-SPF: pass (athena.apache.org: local policy)
Message-ID: <546585F7.1010809@flytxt.com>
Date: Fri, 14 Nov 2014 10:02:55 +0530
From: Meethu Mathew <meethu.mathew@flytxt.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:31.0) Gecko/20100101 Thunderbird/31.2.0
MIME-Version: 1.0
To: Ashutosh <ashutosh.trivedi@iiitb.org>,
 "dev@spark.incubator.apache.org" <dev@spark.incubator.apache.org>
Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection
References: <1414516353788-8992.post@n3.nabble.com>
 <1414729862397-9034.post@n3.nabble.com>
 <CANvqkYM=kC5sKgo4Te1omGmfdBUMtdkVW=MoXSdHNKrHP9cFzg@mail.gmail.com>
 <1414730269133.81923@iiitb.org>
 <CANvqkYNgfR_JdMB-mGNHunnNm_NKPu54Kqc-7tm6agc_m6ZS4g@mail.gmail.com>
 <1415114508803.23734@iiitb.org>
 <CANvqkYO40kJpfmepnQSk+Pm45z5kTRu0Vjng4Lfhm3hsoW+zAw@mail.gmail.com>
 <CAAqHKj606ZV=9_kCBQ44gAwtaKzUyBwkiw2PYeRtB02-qPB7JA@mail.gmail.com>
 <1415729133876.39885@iiitb.org>
 <CANvqkYO4eze1_CWG-bPyyj1KAsDR_dtmFLTPfDWJFOZW6jFZsw@mail.gmail.com>
 <1415903454345.71737@iiitb.org>
In-Reply-To: <1415903454345.71737@iiitb.org>
Content-Type: multipart/alternative;
	boundary="------------080004040801070902090304"

--------------080004040801070902090304
Content-Type: text/plain; charset="utf-8"; format=flowed
Content-Transfer-Encoding: 8bit

Hi Ashutosh,

Please edit the README file.I think the following function call is 
changed now.

|model = OutlierWithAVFModel.outliers(master:String, input dir:String , percentage:Double||)
|

Regards,

*Meethu Mathew*

*Engineer*

*Flytxt*

_<http://www.linkedin.com/home?trk=hb_tab_home_top>_

On Friday 14 November 2014 12:01 AM, Ashutosh wrote:
> Hi Anant,
>
> Please see the changes.
>
> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
>
>
> I have changed the input format to Vector of String. I think we can also make it generic.
>
>
> Line 59 & 72 : that counter will not affect in parallelism, Since it only work on one datapoint. It  only                         does the Indexing of the column.
>
>
> Rest all side effects have been removed.
>
> ​
>
> Thanks,
>
> Ashutosh
>
>
>
>
> ________________________________
> From: slcclimber [via Apache Spark Developers List] <ml-node+s1001551n9287h73@n3.nabble.com>
> Sent: Tuesday, November 11, 2014 11:46 PM
> To: Ashutosh Trivedi (MT2013030)
> Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection
>
>
> Mayur,
> Libsvm format sounds good to me. I could work on writing the tests if that helps you?
> Anant
>
> On Nov 11, 2014 11:06 AM, "Ashutosh [via Apache Spark Developers List]" <[hidden email]</user/SendEmail.jtp?type=node&node=9287&i=0>> wrote:
>
> Hi Mayur,
>
> Vector data types are implemented using breeze library, it is presented at
>
> .../org/apache/spark/mllib/linalg
>
>
> Anant,
>
> One restriction I found that a vector can only be of 'Double', so it actually restrict the user.
>
> What are you thoughts on LibSVM format?
>
> Thanks for the comments, I was just trying to get away from those increment /decrement functions, they look ugly. Points are noted. I'll try to fix them soon. Tests are also required for the code.
>
>
> Regards,
>
> Ashutosh
>
>
> ________________________________
> From: Mayur Rustagi [via Apache Spark Developers List] <ml-node+[hidden email]<http://user/SendEmail.jtp?type=node&node=9286&i=0>>
> Sent: Saturday, November 8, 2014 12:52 PM
> To: Ashutosh Trivedi (MT2013030)
> Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection
>
>> We should take a vector instead giving the user flexibility to decide
>> data source/ type
> What do you mean by vector datatype exactly?
>
> Mayur Rustagi
> Ph: <a href="tel:%2B1%20%28760%29%20203%203257" value="+17602033257" target="_blank">+1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>
>
> On Wed, Nov 5, 2014 at 6:45 AM, slcclimber <[hidden email]<http://user/SendEmail.jtp?type=node&node=9239&i=0>> wrote:
>
>> Ashutosh,
>> I still see a few issues.
>> 1. On line 112 you are counting using a counter. Since this will happen in
>> a RDD the counter will cause issues. Also that is not good functional style
>> to use a filter function with a side effect.
>> You could use randomSplit instead. This does not the same thing without the
>> side effect.
>> 2. Similar shared usage of j in line 102 is going to be an issue as well.
>> also hash seed does not need to be sequential it could be randomly
>> generated or hashed on the values.
>> 3. The compute function and trim scores still runs on a comma separeated
>> RDD. We should take a vector instead giving the user flexibility to decide
>> data source/ type. what if we want data from hive tables or parquet or JSON
>> or avro formats. This is a very restrictive format. With vectors the user
>> has the choice of taking in whatever data format and converting them to
>> vectors insteda of reading json files creating a csv file and then workig
>> on that.
>> 4. Similar use of counters in 54 and 65 is an issue.
>> Basically the shared state counters is a huge issue that does not scale.
>> Since the processing of RDD's is distributed and the value j lives on the
>> master.
>>
>> Anant
>>
>>
>>
>> On Tue, Nov 4, 2014 at 7:22 AM, Ashutosh [via Apache Spark Developers List]
>> <[hidden email]<http://user/SendEmail.jtp?type=node&node=9239&i=1>> wrote:
>>
>>>   Anant,
>>>
>>> I got rid of those increment/ decrements functions and now code is much
>>> cleaner. Please check. All your comments have been looked after.
>>>
>>>
>>>
>> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
>>>
>>>   _Ashu
>>>
>>> <
>> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
>>>    Outlier-Detection-with-AVF-Spark/OutlierWithAVFModel.scala at master ·
>>> codeAshu/Outlier-Detection-with-AVF-Spark · GitHub
>>>   Contribute to Outlier-Detection-with-AVF-Spark development by creating
>> an
>>> account on GitHub.
>>>   Read more...
>>> <
>> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
>>>
>>>   ------------------------------
>>> *From:* slcclimber [via Apache Spark Developers List] <ml-node+[hidden
>>> email] <http://user/SendEmail.jtp?type=node&node=9083&i=0>>
>>> *Sent:* Friday, October 31, 2014 10:09 AM
>>> *To:* Ashutosh Trivedi (MT2013030)
>>> *Subject:* Re: [MLlib] Contributing Algorithm for Outlier Detection
>>>
>>>
>>> You should create a jira ticket to go with it as well.
>>> Thanks
>>> On Oct 30, 2014 10:38 PM, "Ashutosh [via Apache Spark Developers List]"
>> <[hidden
>>> email] <http://user/SendEmail.jtp?type=node&node=9037&i=0>> wrote:
>>>
>>>>   ​Okay. I'll try it and post it soon with test case. After that I think
>>>> we can go ahead with the PR.
>>>>   ------------------------------
>>>> *From:* slcclimber [via Apache Spark Developers List] <ml-node+[hidden
>>>> email] <http://user/SendEmail.jtp?type=node&node=9036&i=0>>
>>>> *Sent:* Friday, October 31, 2014 10:03 AM
>>>> *To:* Ashutosh Trivedi (MT2013030)
>>>> *Subject:* Re: [MLlib] Contributing Algorithm for Outlier Detection
>>>>
>>>>
>>>> Ashutosh,
>>>> A vector would be a good idea vectors are used very frequently.
>>>> Test data is usually stored in the spark/data/mllib folder
>>>>   On Oct 30, 2014 10:31 PM, "Ashutosh [via Apache Spark Developers List]"
>>>> <[hidden email] <http://user/SendEmail.jtp?type=node&node=9035&i=0>>
>>>> wrote:
>>>>
>>>>> Hi Anant,
>>>>> sorry for my late reply. Thank you for taking time and reviewing it.
>>>>>
>>>>> I have few comments on first issue.
>>>>>
>>>>> You are correct on the string (csv) part. But we can not take input of
>>>>> type you mentioned. We calculate frequency in our function. Otherwise
>> user
>>>>> has to do all this computation. I realize that taking a RDD[Vector]
>> would
>>>>> be general enough for all. What do you say?
>>>>>
>>>>> I agree on rest all the issues. I will correct them soon and post it.
>>>>> I have a doubt on test cases. Where should I put data while giving test
>>>>> scripts? or should i generate synthetic data for testing with in the
>>>>> scripts, how does this work?
>>>>>
>>>>> Regards,
>>>>> Ashutosh
>>>>>
>>>>> ------------------------------
>>>>>   If you reply to this email, your message will be added to the
>>>>> discussion below:
>>>>>
>>>>>
>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9034.html
>>>>>   To unsubscribe from [MLlib] Contributing Algorithm for Outlier
>>>>> Detection, click here.
>>>>> NAML
>>>>> <
>> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
>>>>
>>>> ------------------------------
>>>>   If you reply to this email, your message will be added to the
>>>> discussion below:
>>>>
>>>>
>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9035.html
>>>>   To unsubscribe from [MLlib] Contributing Algorithm for Outlier
>>>> Detection, click here.
>>>> NAML
>>>> <
>> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
>>>>
>>>> ------------------------------
>>>>   If you reply to this email, your message will be added to the
>>>> discussion below:
>>>>
>>>>
>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9036.html
>>>>   To unsubscribe from [MLlib] Contributing Algorithm for Outlier
>>>> Detection, click here.
>>>> NAML
>>>> <
>> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
>>>
>>> ------------------------------
>>>   If you reply to this email, your message will be added to the discussion
>>> below:
>>>
>>>
>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9037.html
>>>   To unsubscribe from [MLlib] Contributing Algorithm for Outlier
>> Detection, click
>>> here.
>>> NAML
>>> <
>> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
>>>
>>>
>>> ------------------------------
>>>   If you reply to this email, your message will be added to the discussion
>>> below:
>>>
>>>
>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9083.html
>>>   To unsubscribe from [MLlib] Contributing Algorithm for Outlier
>> Detection, click
>>> here
>>> <
>>>
>>> .
>>> NAML
>>> <
>> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
>>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9095.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>
> ________________________________
> If you reply to this email, your message will be added to the discussion below:
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9239.html
> To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click here.
> NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>
>
> ________________________________
> If you reply to this email, your message will be added to the discussion below:
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9286.html
> To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click here.
> NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>
>
> ________________________________
> If you reply to this email, your message will be added to the discussion below:
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9287.html
> To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click here<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=8880&code=YXNodXRvc2gudHJpdmVkaUBpaWl0Yi5vcmd8ODg4MHwtMzkzMzE5NzYx>.
> NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>
>
>
>
> --
> View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9327.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.


--------------080004040801070902090304--