systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Deron Eriksson <deroneriks...@gmail.com>
Subject Re: [DISCUSS] Project Roadmap
Date Fri, 01 Jan 2016 23:22:47 GMT
Hi Matthias,

I agree about the JIRA situation. I am writing notes in a notebook to keep
track of what I am working on. I was hopeful that Alan's comment from Dec
17th would help with https://issues.apache.org/jira/browse/INFRA-10714. If
the missing fields can't be imported automatically, perhaps we can manually
update the missing fields and move forward.

As for specific algorithms, I'm not well-versed enough in the ML field to
make worthwhile suggestions without significant research, but I can think
of some qualities that I would look for in candidate algorithms:

* Is there a strong demand/need for the algorithm? For example, in a very
general sense, if a Google search is done for the algorithm, do we get back
10 hits or 10 million hits? Of course, in the case of a novel algorithm, no
results would be returned because the world doesn't even know that it needs
this algorithm yet.

* How much effort is required for the algorithm implementation? 10 lines of
DML vs 1000 lines of DML makes a huge difference.

* Does a high-performance, scalable version already exist, such as in Spark
MLlib? If a great implementation already exists, perhaps another algorithm
should be chosen, unless a strong need is seen for customizability of that
algorithm.

* Is an algorithm so ubiquitous that a toolkit of DML algorithms would seem
incomplete without it?

* Does an open-source R implementation already exist? If so, it could serve
as a useful starting point to a DML implementation for distributed
computing.

* Personally, this interests me... Does the algorithm solve an interesting
problem that generates results that can be presented in a way that has
sensory impact? This is a "wow" factor. Imagine presenting results at a
conference and having the audience murmur because they're impressed.
Pictures and graphs make a compelling case.
 (1) For example, think about something like facial recognition. What if an
algorithm is used in applications that let you do the following queries:
     Find pictures of me.
     Find pictures of people that look like me.
     Where do I look most similar to people on the planet?
     If I exercise, what will I look like in 20 years?
     How old do I currently look?
     What historical figure do I look most like?
     What movie character do I look most like?
  To me, answers to questions like these have a certain "wow" about them
because of the visuals that can be tied to them.
 (2) As another example, I saw Fred do a presentation regarding Poisson
Nonnegative Matrix Factorization and thought the graphical presentation of
the results were amazing and compelling. His graphs conveyed both the
accuracy and scalability of the DML algorithm, in addition to SystemML's
customizability applied to a real-world business case. It also showcased
the power of DML utilizing a very compact piece of code.

Deron



On Thu, Dec 31, 2015 at 9:36 AM, Matthias Boehm <mboehm@us.ibm.com> wrote:

> That's a good point Deron - we will incorporate these tasks into the road
> map. Additionally, we should also include a list of new algorithms. Any
> suggestions?
>
> Furthermore, I'd like to have all JIRAs created by mid January. If the
> infra ticket is not resolved by then, I would rather start with a clean
> JIRA than waiting for this any longer.
>
> Regards
> Matthias
>
> [image: Inactive hide details for Deron Eriksson ---12/30/2015 05:58:17
> PM---Hi, I would like to suggest some documentation/usability/c]Deron
> Eriksson ---12/30/2015 05:58:17 PM---Hi, I would like to suggest some
> documentation/usability/code tasks for the
>
> From: Deron Eriksson <deroneriksson@gmail.com>
> To: dev@systemml.incubator.apache.org
> Date: 12/30/2015 05:58 PM
> Subject: Re: [DISCUSS] Project Roadmap
> ------------------------------
>
>
>
> Hi,
>
> I would like to suggest some documentation/usability/code tasks for the
> 2016 SystemML roadmap. The primary focus of these goals is to lower the
> barrier to entry to SystemML for these groups: (1) Users without a data
> science/ML background who want to try SystemML, (2) Data scientists who
> want to run, modify, and create DML/PyDML scripts, (3) Developers who want
> to contribute code to the project, and (4) Spark community who want to use
> the MLContext API or Spark Batch Mode.
>
> Tasks:
>
> * Non-mathematical practical description of the purpose of each algorithm
> and real-world examples of problems that each algorithm solves.
>
> * Examples showing the conversion of real-world data sets (Wikipedia
> database, images, log files, Twitter messages, etc) to matrix-based
> representations for use in SystemML.
>
> * Working one-line examples of invoking each algorithm on an existing small
> data set (The user can copy/paste this single line and it runs). This means
> creating working example data files so that the user doesn't need to. These
> data files can be in the SystemML project, in another project, or they can
> be deployed to a web server and SystemML can read the data sets from URLs.
>
> * DML Cookbook to give script writers the DML building blocks they need.
>
> * DML Language Reference completely up-to-date.
>
> * PyDML Language Reference converted to markdown, clean mirror of DML
> Language Reference, and up-to-date.
>
> * Document DML algorithm best practices into programming guide (especially,
> how to write algorithms that scale efficiently).
>
> * Structure documentation to more clearly indicate the ways to invoke
> SystemML.
>
> * Identify heavily used classes/methods (run test suite with a profiler)
> and ensure these classes/methods have Javadocs and are efficient.
>
> * Create printMatrix() function to allow a user doing prototyping to see a
> matrix or a subset of a matrix in the console rather than having to write
> to a file and open the file to see the result.
>
> * If a DML function doesn't return a value, don't require an lvalue when
> calling the function.
>
> * Spark Batch Mode clearly documented.
>
> * Very thoroughly Javadoc the MLContext API (MLContext and related
> classes/methods) since it is a programmatic interface with enormous
> potential for the Spark community.
>
> * Address differences in data representations between Spark (RDD/DataFrame)
> and SystemML (binary block). Determine solution to give best performance
> when working on a large distributed data set while optimizing the
> capabilities of Spark and SystemML. Is DataFrame-to-binary-block conversion
> needed or is it possible to use a single format and avoid the data
> conversion cost?
>
> * Enhanced Spark integration, for instance ML Pipeline integration via Java
> or Scala algorithm wrappers.
>
> * Ensure documentation allows a user to download SystemML and run a 'Hello
> World' DML example and an actual algorithm in 5 minutes or less.
>
> * IDE tools such as DML editor that allows code completion.
>
> * Promote SystemML in the user community:
>  (1) activity on mailing lists
>  (2) talks at conferences
>  (3) academic papers
>  (4) blog posts
>  (5) post information to forums such as stackoverflow
>
> Deron
>
>
> On Mon, Dec 21, 2015 at 3:09 AM, Matthias Boehm <mboehm@us.ibm.com> wrote:
>
> > From my perspective, our roadmap for 2016 should cover the following
> > SystemML engine extensions with regard to runtime (R), optimizer (O), as
> > well as language and tools (L). Each sub-bullet in the following list
> will
> > be further broken down into multiple JIRAs.
> >
> > R1) Extended Scale-Up Backend
> > * Support for large dense matrix blocks >16GB
> > * Extended multi-threaded operations (e.g., ctable, aggregate)
> > * NUMA-awareness (partitioning and multi-threaded operations)
> > * Extended update-in-place support
> >
> > R2) Generalized Matrix Block Library
> > * Investigation interface design (abstraction)
> > * Boolean matrices and operations
> > * Different types of sparse matrix blocks
> > * Additional physical runtime operators
> >
> > R3) HW Accelerators / Low-Level Optimizations
> > * Exploit GPU BLAS libraries (integration)
> > * Custom GPU kernels for complex operator patterns
> > * Low-level optimizations (source code gen, compression)
> >
> > O1) Global Program Optimization
> > * Global data flow optimization (rewrites, holistic)
> > * Code motion (for cse, program block merge)
> > * Advanced loop vectorization (common patterns)
> > * Advanced function inlining (inlining multi-block functions)
> > * Extended inter-procedure analysis (independent constant propagation)
> >
> > O2) Cost Model
> > * Update memory budgets wrt Spark 1.6 dynamic memory management
> > * Extended runtime cost model for Spark (incl lazy evaluation)
> > * Extended execution type selection based on FLOPs
> >
> > O3) Dynamic Rewrites
> > * Extended matrix mult chain opt (sparsity, rewrites, ops)
> > * Rewrites exploiting additional statistics (e.g., min/max)
> >
> > O4) Optimizer Support R2/R3
> > * Extended memory estimates for R2/R3
> > * Type inference for matrix operations
> > * Extended cost model and operator selection
> >
> > L1) Extended Spark Interfaces
> > * Hardening MLContext (config, lazy eval, cleanup)
> > * Extended Spark ML wrappers for all algorithms
> > * Investigation of embedded DSL with sufficient optimization scope
> >
> > L2) New/Extended Builtin Functions
> > * Second order functions (apply), incl optimizer/runtime support
> > * Generalization of existing functions from vectors to matrices
> > * Additional builtin functions (e.g., var, sd, rev, rep, sign, etc)
> >
> > L3) Extended Dev Tools
> > * Extended statistics output (e.g., wrt Spark lazy evaluation)
> > * Extended benchmarking (data generators, test suites, etc)
> >
> > Once we create the individual JIRAs, we should also include a list of new
> > algorithms as well as additional documentation guides.
> >
> >
> > Regards,
> > Matthias
> >
> >
> > [image: Inactive hide details for Luciano Resende ---11/20/2015 01:50:16
> > AM---Now that we are done with our 0.8.0 (non-apache) Release,]Luciano
> > Resende ---11/20/2015 01:50:16 AM---Now that we are done with our 0.8.0
> > (non-apache) Release, and have most of our infrastructure in pl
> >
> > From: Luciano Resende <luckbr1975@gmail.com>
> > To: dev@systemml.incubator.apache.org
> > Date: 11/20/2015 01:50 AM
> > Subject: [DISCUSS] Project Roadmap
> > ------------------------------
> >
> >
> >
> > Now that we are done with our 0.8.0 (non-apache) Release,  and have most
> of
> > our infrastructure in place at Apache, I would like to start some
> > discussion around what are some high level items you see we could be
> > working on the short/medium term, and start building a Roadmap, so new
> > contributors can easily find areas of their interest to start
> contributing.
> >
> > Let's have items listed on this thread, and once we have our JIRA
> > available, we start updating it there.
> >
> > Thanks
> >
> > --
> > Luciano Resende
> > http://people.apache.org/~lresende
> > http://twitter.com/lresende1975
> > http://lresende.blogspot.com/
> >
> >
> >
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message