Return-Path: X-Original-To: apmail-spark-dev-archive@minotaur.apache.org Delivered-To: apmail-spark-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DE3F410BBF for ; Fri, 14 Nov 2014 04:33:58 +0000 (UTC) Received: (qmail 96654 invoked by uid 500); 14 Nov 2014 04:33:58 -0000 Delivered-To: apmail-spark-dev-archive@spark.apache.org Received: (qmail 96482 invoked by uid 500); 14 Nov 2014 04:33:57 -0000 Mailing-List: contact dev-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list dev@spark.apache.org Received: (qmail 96471 invoked by uid 99); 14 Nov 2014 04:33:57 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 14 Nov 2014 04:33:57 +0000 X-ASF-Spam-Status: No, hits=2.8 required=10.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,URI_HEX X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [66.46.182.55] (HELO relay.ihostexchange.net) (66.46.182.55) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 14 Nov 2014 04:33:53 +0000 Received: from [192.168.125.249] (125.17.228.30) by smtp.ihostexchange.net (66.46.182.50) with Microsoft SMTP Server (TLS) id 8.3.377.0; Thu, 13 Nov 2014 23:32:28 -0500 Message-ID: <546585F7.1010809@flytxt.com> Date: Fri, 14 Nov 2014 10:02:55 +0530 From: Meethu Mathew User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.2.0 MIME-Version: 1.0 To: Ashutosh , "dev@spark.incubator.apache.org" Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection References: <1414516353788-8992.post@n3.nabble.com> <1414729862397-9034.post@n3.nabble.com> <1414730269133.81923@iiitb.org> <1415114508803.23734@iiitb.org> <1415729133876.39885@iiitb.org> <1415903454345.71737@iiitb.org> In-Reply-To: <1415903454345.71737@iiitb.org> Content-Type: multipart/alternative; boundary="------------080004040801070902090304" X-Virus-Checked: Checked by ClamAV on apache.org --------------080004040801070902090304 Content-Type: text/plain; charset="utf-8"; format=flowed Content-Transfer-Encoding: 8bit Hi Ashutosh, Please edit the README file.I think the following function call is changed now. |model = OutlierWithAVFModel.outliers(master:String, input dir:String , percentage:Double||) | Regards, *Meethu Mathew* *Engineer* *Flytxt* __ On Friday 14 November 2014 12:01 AM, Ashutosh wrote: > Hi Anant, > > Please see the changes. > > https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala > > > I have changed the input format to Vector of String. I think we can also make it generic. > > > Line 59 & 72 : that counter will not affect in parallelism, Since it only work on one datapoint. It only does the Indexing of the column. > > > Rest all side effects have been removed. > > ​ > > Thanks, > > Ashutosh > > > > > ________________________________ > From: slcclimber [via Apache Spark Developers List] > Sent: Tuesday, November 11, 2014 11:46 PM > To: Ashutosh Trivedi (MT2013030) > Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection > > > Mayur, > Libsvm format sounds good to me. I could work on writing the tests if that helps you? > Anant > > On Nov 11, 2014 11:06 AM, "Ashutosh [via Apache Spark Developers List]" <[hidden email]> wrote: > > Hi Mayur, > > Vector data types are implemented using breeze library, it is presented at > > .../org/apache/spark/mllib/linalg > > > Anant, > > One restriction I found that a vector can only be of 'Double', so it actually restrict the user. > > What are you thoughts on LibSVM format? > > Thanks for the comments, I was just trying to get away from those increment /decrement functions, they look ugly. Points are noted. I'll try to fix them soon. Tests are also required for the code. > > > Regards, > > Ashutosh > > > ________________________________ > From: Mayur Rustagi [via Apache Spark Developers List] > > Sent: Saturday, November 8, 2014 12:52 PM > To: Ashutosh Trivedi (MT2013030) > Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection > >> We should take a vector instead giving the user flexibility to decide >> data source/ type > What do you mean by vector datatype exactly? > > Mayur Rustagi > Ph: +1 (760) 203 3257 > http://www.sigmoidanalytics.com > @mayur_rustagi > > > On Wed, Nov 5, 2014 at 6:45 AM, slcclimber <[hidden email]> wrote: > >> Ashutosh, >> I still see a few issues. >> 1. On line 112 you are counting using a counter. Since this will happen in >> a RDD the counter will cause issues. Also that is not good functional style >> to use a filter function with a side effect. >> You could use randomSplit instead. This does not the same thing without the >> side effect. >> 2. Similar shared usage of j in line 102 is going to be an issue as well. >> also hash seed does not need to be sequential it could be randomly >> generated or hashed on the values. >> 3. The compute function and trim scores still runs on a comma separeated >> RDD. We should take a vector instead giving the user flexibility to decide >> data source/ type. what if we want data from hive tables or parquet or JSON >> or avro formats. This is a very restrictive format. With vectors the user >> has the choice of taking in whatever data format and converting them to >> vectors insteda of reading json files creating a csv file and then workig >> on that. >> 4. Similar use of counters in 54 and 65 is an issue. >> Basically the shared state counters is a huge issue that does not scale. >> Since the processing of RDD's is distributed and the value j lives on the >> master. >> >> Anant >> >> >> >> On Tue, Nov 4, 2014 at 7:22 AM, Ashutosh [via Apache Spark Developers List] >> <[hidden email]> wrote: >> >>> Anant, >>> >>> I got rid of those increment/ decrements functions and now code is much >>> cleaner. Please check. All your comments have been looked after. >>> >>> >>> >> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala >>> >>> _Ashu >>> >>> < >> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala >>> Outlier-Detection-with-AVF-Spark/OutlierWithAVFModel.scala at master · >>> codeAshu/Outlier-Detection-with-AVF-Spark · GitHub >>> Contribute to Outlier-Detection-with-AVF-Spark development by creating >> an >>> account on GitHub. >>> Read more... >>> < >> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala >>> >>> ------------------------------ >>> *From:* slcclimber [via Apache Spark Developers List] >> email] > >>> *Sent:* Friday, October 31, 2014 10:09 AM >>> *To:* Ashutosh Trivedi (MT2013030) >>> *Subject:* Re: [MLlib] Contributing Algorithm for Outlier Detection >>> >>> >>> You should create a jira ticket to go with it as well. >>> Thanks >>> On Oct 30, 2014 10:38 PM, "Ashutosh [via Apache Spark Developers List]" >> <[hidden >>> email] > wrote: >>> >>>> ​Okay. I'll try it and post it soon with test case. After that I think >>>> we can go ahead with the PR. >>>> ------------------------------ >>>> *From:* slcclimber [via Apache Spark Developers List] >>> email] > >>>> *Sent:* Friday, October 31, 2014 10:03 AM >>>> *To:* Ashutosh Trivedi (MT2013030) >>>> *Subject:* Re: [MLlib] Contributing Algorithm for Outlier Detection >>>> >>>> >>>> Ashutosh, >>>> A vector would be a good idea vectors are used very frequently. >>>> Test data is usually stored in the spark/data/mllib folder >>>> On Oct 30, 2014 10:31 PM, "Ashutosh [via Apache Spark Developers List]" >>>> <[hidden email] > >>>> wrote: >>>> >>>>> Hi Anant, >>>>> sorry for my late reply. Thank you for taking time and reviewing it. >>>>> >>>>> I have few comments on first issue. >>>>> >>>>> You are correct on the string (csv) part. But we can not take input of >>>>> type you mentioned. We calculate frequency in our function. Otherwise >> user >>>>> has to do all this computation. I realize that taking a RDD[Vector] >> would >>>>> be general enough for all. What do you say? >>>>> >>>>> I agree on rest all the issues. I will correct them soon and post it. >>>>> I have a doubt on test cases. Where should I put data while giving test >>>>> scripts? or should i generate synthetic data for testing with in the >>>>> scripts, how does this work? >>>>> >>>>> Regards, >>>>> Ashutosh >>>>> >>>>> ------------------------------ >>>>> If you reply to this email, your message will be added to the >>>>> discussion below: >>>>> >>>>> >> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9034.html >>>>> To unsubscribe from [MLlib] Contributing Algorithm for Outlier >>>>> Detection, click here. >>>>> NAML >>>>> < >> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml >>>> >>>> ------------------------------ >>>> If you reply to this email, your message will be added to the >>>> discussion below: >>>> >>>> >> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9035.html >>>> To unsubscribe from [MLlib] Contributing Algorithm for Outlier >>>> Detection, click here. >>>> NAML >>>> < >> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml >>>> >>>> ------------------------------ >>>> If you reply to this email, your message will be added to the >>>> discussion below: >>>> >>>> >> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9036.html >>>> To unsubscribe from [MLlib] Contributing Algorithm for Outlier >>>> Detection, click here. >>>> NAML >>>> < >> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml >>> >>> ------------------------------ >>> If you reply to this email, your message will be added to the discussion >>> below: >>> >>> >> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9037.html >>> To unsubscribe from [MLlib] Contributing Algorithm for Outlier >> Detection, click >>> here. >>> NAML >>> < >> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml >>> >>> >>> ------------------------------ >>> If you reply to this email, your message will be added to the discussion >>> below: >>> >>> >> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9083.html >>> To unsubscribe from [MLlib] Contributing Algorithm for Outlier >> Detection, click >>> here >>> < >>> >>> . >>> NAML >>> < >> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml >>> >> >> >> >> -- >> View this message in context: >> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9095.html >> Sent from the Apache Spark Developers List mailing list archive at >> Nabble.com. >> > > ________________________________ > If you reply to this email, your message will be added to the discussion below: > http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9239.html > To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click here. > NAML > > > ________________________________ > If you reply to this email, your message will be added to the discussion below: > http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9286.html > To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click here. > NAML > > > ________________________________ > If you reply to this email, your message will be added to the discussion below: > http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9287.html > To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click here. > NAML > > > > > -- > View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9327.html > Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------080004040801070902090304--