Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 43474200BA7 for ; Fri, 21 Oct 2016 23:25:30 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 419EA160AE8; Fri, 21 Oct 2016 21:25:30 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 5F831160ADE for ; Fri, 21 Oct 2016 23:25:29 +0200 (CEST) Received: (qmail 77124 invoked by uid 500); 21 Oct 2016 21:25:28 -0000 Mailing-List: contact dev-help@systemml.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@systemml.incubator.apache.org Delivered-To: mailing list dev@systemml.incubator.apache.org Received: (qmail 77107 invoked by uid 99); 21 Oct 2016 21:25:28 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 21 Oct 2016 21:25:28 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id D03ABC1C75 for ; Fri, 21 Oct 2016 21:25:27 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.379 X-Spam-Level: ** X-Spam-Status: No, score=2.379 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id I8TNKZEtwYZH for ; Fri, 21 Oct 2016 21:25:24 +0000 (UTC) Received: from mail-yw0-f174.google.com (mail-yw0-f174.google.com [209.85.161.174]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id BEAEA5F1C2 for ; Fri, 21 Oct 2016 21:25:23 +0000 (UTC) Received: by mail-yw0-f174.google.com with SMTP id u124so116950679ywg.3 for ; Fri, 21 Oct 2016 14:25:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=2vls1hoK/CYkZN/+xm7rOvRkedESTkg1huv0dZgfswA=; b=EiLcZXt/CMuSy+XpxYoP0k/W6gPfDt+yRb5HlfW/f2o85Grac0ofM0Re4qEX1JTqLA o2zAPzLoUN5S4hNcMv/O2uwaq5AP9tHrfa89b3mgZhjKg1fLvr2Hb/iGPAHHtGPaq59u B3jXHcxKLM+TMgtF6iVKHm9H7c6BqvnjKcGKbaPIHuWGG+yGD9XexSg1hcvsJLNdeQiQ 5dF6/vsKX8etpjcLMlGzuvnVGOruD61jifIaOVA6oaW3FEmgY2tOG7Qk9gOl9oyHPSEG wgX2HkDlhy88tb3ibvwN0pVg9EVw4ooKNhg6CpNyZ3DIYZt8ImJLqWtQIwBO623oxPmk ACLQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=2vls1hoK/CYkZN/+xm7rOvRkedESTkg1huv0dZgfswA=; b=Smt/R2QI9CktfgXsWGhykNfGlp3/K3darmJ4+Df7yX81xRVBsB3r6BX6Ft6Te0jktZ 0h5aKvshsOMMcqyBDeMaS1OJyz9kRrYyNMPYB6hRaUTOcf9T5E7oh3/qqk0rLAVAyLp1 eI/1WdbGjEWjuGEGvK8TCpdAiNit6s6JQtu6O9tI4CaGBN5hLPYI2Jc9G8WIcu7AXrpS bQo49f05VtyoynQM/aBkahz7ZtaaF2M9f4h5M+K6mJDsQvY7IcQPSKz2Ia6/7E0HmFPS Hum5jpEWc9bvvtHFFU3pv3XiD/pIUFZJalKPqn7BLzTAkPJFu/aE/lsO8FgPPSEHQbNS dF3g== X-Gm-Message-State: AA6/9RnYPHjSv4kDe46XXu7YCLI5BJZKg38eIRGlSo4Vmj8I5G/ST0vx57ov7k8IXOFKFopGH8rqSV0xGDnwcQ== X-Received: by 10.202.78.205 with SMTP id c196mr12287055oib.158.1477085122530; Fri, 21 Oct 2016 14:25:22 -0700 (PDT) MIME-Version: 1.0 Received: by 10.202.5.3 with HTTP; Fri, 21 Oct 2016 14:25:22 -0700 (PDT) In-Reply-To: References: From: Deron Eriksson Date: Fri, 21 Oct 2016 14:25:22 -0700 Message-ID: Subject: Re: use of systemml-0.10.0.incubating.jar To: dev@systemml.incubator.apache.org Content-Type: multipart/alternative; boundary=001a11c16b18081368053f66ae42 archived-at: Fri, 21 Oct 2016 21:25:30 -0000 --001a11c16b18081368053f66ae42 Content-Type: text/plain; charset=UTF-8 Hi James, Thank you for the great questions! I think some of the issues that you are experiencing are usage issues from a failure on our part to convey this information clearly. The good news is that a tremendous amount of effort and focus is currently being directed towards fixing our website and documentation. We also have very significant upcoming releases (we are just finishing with our 0.11.0 voting). 1) Here is some background to help. The main jar ("systemml-0.10.0.incubating.jar") is typically used to perform scalable machine learning across a Spark or Hadoop cluster. Spark and Hadoop both have a large number of jars packaged with them (from a maven viewpoint these are treated as provided dependencies). In addition, SystemML has some additional libraries that it needs (wink, some antlr, etc) that are not provided by Spark and Hadoop, so these libraries are treated by SystemML as compile-scope dependencies and included in the main jar so that if you would like to run SystemML on Spark or Hadoop, you only need to include the single SystemML jar, as in these examples: $SPARK_HOME/bin/spark-submit systemml-0.10.0.incubating.jar -s "print('hello world');" -exec hybrid_spark hadoop jar systemml-0.10.0.incubating.jar -s "print('hello world');" So, I think the compile-scope dependencies haven't been shaded because typically the main jar runs on Spark or Hadoop rather than being treated as a library. I think shading to change the namespaces so as to avoid namespace collisions is a great idea in case the SystemML jar is being used as a library. 2) One of the ideas regarding SystemML is the ability to easily customize scalable machine learning algorithms. We have .tar.gz and .zip artifacts that can be unpacked that offer the scripts as text files that can easily be modified. However, we also package them into the jar files in case someone wants to run them and not really modify them. The Connection class is part of the JMLC API (see http://apache.github.io/incubator-systemml/jmlc.html), one of multiple APIs that can be used to run SystemML. This API is fairly specialized and I believe if you want to access a script in the jar using this API that you need to do a getResourceAsStream and read the script as an InputStream. However, if you would like to use a programmatic API to SystemML, I would recommend the new SystemML MLContext API (0.10.0 contains an old MLContext API and the very soon to be released 0.11.0 contains the completely redesigned MLContext API). The new MLContext API features many conveniences such as ScriptFactory.dmlFromResource() which lets you easily read a DML file from the SystemML jar. For more information about this API, see http://apache.github.io/incubator-systemml/spark-mlcontext-programming-guide.html 3) As a Java developer with a lot of maven experience, my first inclination when working with SystemML was to try to use the main jar as a library, and I believe you are having the same experience I did. Because of the way the project is structured, using SystemML as a library isn't perhaps as easy as it should be. Here are the steps that I just tried out to use the latest SystemML project as a library (using the new MLContext API): A) Check out the latest project and install the snapshot artifacts in local maven repo: mvn clean install -P distribution -DskipTests B) Create a basic Java maven example project with the SystemML snapshot dependency. Since SystemML treats most dependencies as provided scope, I'll re-specify the Spark dependencies with default (compile) scope in my example project's pom.xml. org.apache.systemml systemml 0.12.0-SNAPSHOT org.apache.spark spark-core_2.10 1.4.1 org.apache.spark spark-sql_2.10 1.4.1 org.apache.spark spark-mllib_2.10 1.4.1 C) Create a Java class to run an algorithm on SystemML using the new MLContext API. This example reads the Univar-Stats.dml script from the jar file and runs the Haberman dataset on the algorithm. It outputs the results to the console for viewing. package org.apache.systemml.example; import java.util.ArrayList; import java.util.List; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.sysml.api.mlcontext.MLContext; import org.apache.sysml.api.mlcontext.Script; import org.apache.sysml.api.mlcontext.ScriptFactory; public class MLContextExample { public static void main(String[] args) throws Exception { SparkConf conf = new SparkConf().setAppName("MLContextExample").setMaster("local"); JavaSparkContext sc = new JavaSparkContext(conf); MLContext ml = new MLContext(sc); Script uni = ScriptFactory.dmlFromResource("/scripts/algorithms/Univar-Stats.dml"); String habermanUrl = " http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data "; uni.in("A", new java.net.URL(habermanUrl)); List list = new ArrayList(); list.add("1.0,1.0,1.0,2.0"); JavaRDD typesRDD = sc.parallelize(list); uni.in("K", typesRDD); uni.in("$CONSOLE_OUTPUT", true); ml.execute(uni); } } I believe the JMLC API was originally designed to be a lightweight API. However, I believe it currently requires at least the Hadoop dependencies. Since a primary focus of SystemML is to distribute machine learning across Spark and Hadoop clusters, it typically requires a significant number of transitive dependencies to accomplish this. I hope that helps. Deron On Fri, Oct 21, 2016 at 9:24 AM, Dyer, James wrote: > Taking a look at "systemml-0.10.0.incubating.jar" from maven-central... > > 1. Looks like we have code embedded here in other projects' namespaces: > org.apache.wink , org.antlr, org.abego, com.google.common . Shouldn't we > be using shade to re-namespace these so users do not have potential clashes? > > 2. I see the .dml files are included in the .jar under "scripts". But I > am not sure how to load and use these with an oasaj.Connection . Is there > something I am missing, or is this a to-do ? > > 3. Including "org.apache.systemml:systemml: 0.10.0-incubating" in my > projects's POM did not seem to pull in any transitive dependencies. But > just to instantiate an oasaj.Connection, it needed hadoop-common and > hadoop-mapreduce-client-common. Is this an oversight or am I using the > jar in the wrong way? Also, is there any plan to remove these > dependencies? Ideally using the Java connector wouldn't need to pull in a > significant portion of Hadoop. > > James Dyer > Ingram Content Group > > --001a11c16b18081368053f66ae42--