Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 7E8BA200D18 for ; Wed, 11 Oct 2017 20:22:26 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 7D0271609E5; Wed, 11 Oct 2017 18:22:26 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 4965A1609BB for ; Wed, 11 Oct 2017 20:22:25 +0200 (CEST) Received: (qmail 86079 invoked by uid 500); 11 Oct 2017 18:22:24 -0000 Mailing-List: contact general-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@incubator.apache.org Delivered-To: mailing list general@incubator.apache.org Received: (qmail 86068 invoked by uid 99); 11 Oct 2017 18:22:23 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 Oct 2017 18:22:23 +0000 Received: from mail-ua0-f178.google.com (mail-ua0-f178.google.com [209.85.217.178]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id 75DFC1A002B for ; Wed, 11 Oct 2017 18:22:23 +0000 (UTC) Received: by mail-ua0-f178.google.com with SMTP id i35so1624583uah.9 for ; Wed, 11 Oct 2017 11:22:23 -0700 (PDT) X-Gm-Message-State: AMCzsaUtnY1hyj/e6nDeB4jnW+II92or6IPoC4gKRr8/087wwnNxhzvE EexJyk7Cg2mnggisYanGOkHqra6yu0u05Rrkg+M= X-Google-Smtp-Source: ABhQp+QYnx4Jsqai0ImZZpoFSPzccgRRvL3pmogWrxbGvAxWJ4t/kGOGISR17NM5vDpfE75ZTTFp9RRL6oC6B6fFpC0= X-Received: by 10.176.9.201 with SMTP id e9mr466127uah.2.1507746141337; Wed, 11 Oct 2017 11:22:21 -0700 (PDT) MIME-Version: 1.0 Received: by 10.103.139.193 with HTTP; Wed, 11 Oct 2017 11:22:20 -0700 (PDT) From: lewis john mcgibbney Date: Wed, 11 Oct 2017 11:22:20 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: [DISCUSS] Accept Science Data Analytics Platform (SDAP) into Apache Incubator To: general@incubator.apache.org Cc: "Huang, Thomas (398J)" Content-Type: multipart/alternative; boundary="f403043ed42c2a8133055b4981ed" archived-at: Wed, 11 Oct 2017 18:22:26 -0000 --f403043ed42c2a8133055b4981ed Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Folks, I would like to open a DISCUSS thread on the topic of accepting the Science Data Analytics Platform (SDAP) < https://wiki.apache.org/incubator/SDAPProposal> Project into the Incubator. I am CC'ing Thomas Huang from NASA JPL who I have been working with to build community around a kick-ass set of software projects under the SDAP umbrella. At this stage we would very much appreciate critical feedback from general@ community. We are also open to mentors who may have an interest in the project proposal. The proposal is pasted below. Thanks in advance, Lewis =3D Abstract =3D The Science Data Analytics Platform (SDAP) establishes an integrated data analytic center for Big Science problems. It focuses on technology integration, advancement and maturity. =3D Proposal =3D SDAP currently represents a collaboration between NASA Jet Propulsion Laboratory (JPL), Florida State University (FSU), the National Center for Atmospheric Research (NCAR), and George Mason University (GMU). SDAP brings together a number of big data technologies including a NASA funded OceanXtremes (Anomaly detection and ocean science), NEXUS (Deep data analytic platform), DOMS (Distributed in-situ to satellite matchup), MUDROD (Search relevancy and discovery) and VQSS (Virtualized Quality Screening Service) under a single umbrella. Within the original Incubator proposal, VQSS will not be included however it is anticipated that a future source code donation will cover VQSS. =3D Background and Rationale =3D SDAP is a technology software solution currently geared to better enable scientists involved in advancing the study of the Earth's physical oceanography. With increasing global temperature, warming of the ocean, and melting ice sheets and glaciers, the impacts can be observed from changes in anomalous ocean temperature and circulation patterns, to increasing extreme weather events and stronger/more frequent hurricanes, sea level rise and storm surges affecting coastlines, and may involve drastic changes and shifts in marine ecosystems. Ocean science communities are relying on data distributed through data centers such as the JPL's Physical Oceanographic Data Active Archive Center (PO.DAAC) to conduct their research. In typical investigations, oceanographers follow a traditional workflow for using datasets: search, evaluate, download, and apply tools and algorithms to look for trends. While this workflow has been working very well historically for the oceanographic community, it cannot scale if the research involves massive amount of data. NASA's Surface Water and Ocean Topography (SWOT) mission, scheduled to launch in April of 2021, is expected to generate over 20PB data for a nominal 3-year mission. This will challenge all existing NASA Earth Science data archival/distribution paradigms. It will no longer be feasible for Earth scientists to download and analyze such volumes of data. SDAP was therefore developed primarily as a Web-service platform for big ocean data science at the PO.DAAC with open source solutions used to enable fast analysis of oceanographic data. SDAP has been developed collaboratively between JPL, FSU, NCAR, and GMU and is rapidly maturing to become the generic platform for the next generation of big science data solutions. The platform is an orchestration of several previously funded NASA big ocean data solutions using cloud technology, which include data analysis (NEXUS), anomaly detection (OceanXtremes), matchup (DOMS), subsetting, discovery (MUDROD), and visualization (VQSS). SDAP will enable web-accessible, fast data analysis directly on huge scientific data archives to minimize data movement and provide access, including subset, only to the relevant data. =3D Science Data Analytics Platform Project Overview =3D SDAP consists of several loosely coupled, independently functioning sub-projects. The graphic below displays an overview of how these sub-projects fuse together. N.B., although the graphic uses terminology relating to OceanWorks, essentially the SDAP architecture is identical. {{attachment:sdap.png}} =3D=3D OceanXtremes =3D=3D Oceanographic Data-Intensive Anomaly Detection and Analysis Portal. An application that allows you to view imagery and perform analysis on sea level rise data. '''Objective''' Develop an anomaly detection system which identifies items, events or observations which do not conform to an expected pattern. * Mature and test domain-specific, multi-scale anomaly and feature detection algorithms. * Identify unexpected correlations between key measured variables. Demonstrate value of technologies in this service: * Adapted Map-Reduce data mining. * Algorithm profiling service. * Shared discovery and exploration search tools. * Automatic notification of events of interest. =3D=3D NEXUS =3D=3D NEXUS is an emerging technology developed at JPL * A Cloud-based/Cluster-based data platform that performs scalable handling of observational parameters analysis designed to scale horizontall= y * Leveraging high-performance indexed, temporal, and geospatial search solution * Breaks data products into small chunks and stores them in a Cloud-based data store ''Data Volumes Exploding'' * SWOT mission is coming * File I/O is slow ''Scalable Store & Compute is Available'' * NoSQL cluster databases * Parallel compute, in-memory map-reduce * Bring Compute to Highly-Accessible Data (using Hybrid Cloud) ''Pre-Chunk and Summarize Key Variables'' * Easy statistics instantly (milliseconds) * Harder statistics on-demand (in seconds) * Visualize original data (layers) on a map quickly =3D=3D DOMS =3D=3D The Distributed Oceanographic Match-Up Service DOMS is designed to reconcile satellite and in situ datasets in support of NASA's Earth Science mission. The service will provide a mechanism for users to input a series of geospatial references for satellite observations and receive the in situ observations that are matched to the satellite data within a selectable temporal and spatial domain. DOMS includes several characteristic in situ and satellite observation datasets - with an initial focus on salinity, sea temperature, and winds. DOMS will be used by the marine and satellite research communities to support a range of activities and several use cases will be described. The service is designed to provide a community-accessible tool that dynamically delivers matched data and allows the scientist to only work with the subset of data where the matches exist. =3D=3D MUDROD =3D=3D Mining and Utilizing Dataset Relevancy from Oceanographic Datasets to Improve Data Discovery and Access Data discovery accuracy is a challenging topic for both Earth science and other domains. It is especially true for scientific data sets that are not as popular as Amazon or Google data. MUDROD is focused on mining oceanic knowledge from the PO.DAAC user log files to improve the end user data discovery experience at PO.DAAC. There are three steps in the research: a) the oceanographic semantics were extracted from three resources of SWEET, GCMD ontology, and the keywords used by end users for searching PO.DAAC datasets, b) mining the linkage among different vocabularies based on user data discvoery sessions, and c) build the linkage among vocabularies based on a comprehensive approach by considering domain de facto standard, e.g., SWEET and GCMD, and the knowledge mined from the log files. The semantics is used to improve data discovery for ranking results, navigating among vocabularies, and recommending data based on user searchers. =3D Current Status =3D All components of SDAP were originally designed and developed under grants from the NASA-funded Advanced Information Systems and Technologies (AIST) program. The initiative to bring them the components together under the SDAP umbrella was granted through an AIST-funded follow-on grant which will run for another ~18 or so months. Currently no projects have made official releases so outside of community building, this will be our primary Incubating goal. All SDAP source code is currently publicly available and licensed under the ALv2.0. =3D Meritocracy =3D The current developers are familiar with meritocratic open source development at Apache. The SDAP team consumes Apache products heavily with members being part of several Apache user communities. SDAP itself has critical dependencies upon Apache products. Lewis McGibbney (JPL employee), a Member of the ASF and V.P. of Apache Any23, Gora PMC Nutch, Tika, OODT, OCW, etc., is championing the effort to bring SDAP into and through the Apache Incubator and has been evangelizing the Apache Way to the current SDAP contributors such that the meritocratic process is well understood and followed. Apache was chosen specifically because we want to encourage this style of community development for the project and for it to sustain SDAP forward to become the generic platform for the next generation of big science data solutions =3D Community =3D The SDAP project is a fairly new effort and our community is not yet fully/firmly established. Initial committers comprising the SDAP roster have only recently fully come together as a unified team however there is a large degree of synergy between constituent members at JPL, FSU, NCAR, and GMU. Therefore, community building and publicity continues to be a major thrust. With the activity and exposure regularly attained by several community members, we hope to grow the SDAP presence in and across several (scientific) forums. The SDAP technology is generating interest within communities such as the Earth Science Information Partnership (ESIP), American Geophysical Union (AGU) and plethora or science meetings around the globe. This in effect, we hope, will further contribute towards the possibility of SDAP being used across Government Agencies such as NASA, NOAA, USGS, EPA, DOI, etc. as well as by researchers and students in academic institutions around the globe. During incubation, we will explicitly seek to increase our adoption, with SDAP already being featured on the agenda for several high profile globally significant scientific conferences and meetings. =3D Core Developers =3D The current set of core developers is relatively small, including full-time and students from across JPL, FSU, NCAR, and GMU. Initial community management and participation will be distributed across the entire team, most of which have been involved with the constituent projects for <2 years. =3D Alignment =3D All SDAP code is licensed under Apache v2.0. =3D Known Risks =3D =3D=3D Orphaned products =3D=3D There are currently no orphaned products. Each component of SDAP has dedicated personnel leading and participating in its ongoing development. Additionally, there is substantial collaboration between projects facilitated by regular project meetings which are specific the the initial member entities and focused on advancing physical oceanographic science. =3D=3D Inexperience with Open Source =3D=3D JPL (in particular Lewis McGibbney) has been part of several efforts to transition to and grow projects communities at Apache e.g. Apache OODT, Apache Open Climate Workbench, Apache Joshua (Incubating), Apache SensSoft (Incubating), Apache DRAT (Incubating). Most of the code developed under the SDAP umbrella was and is open source prior to the Incubator effort so we are well familiarized with the nuances of open source software. =3D Relationships with Other Apache Products =3D SDAP has strong dependency upon a number of high profile and smaller profile Apache products. Examples can be seen in the breakdown of External Dependencies. As we continue to grow SDAP within the Incubator, we will make efforts to share community stories, software advancements and possible improvements in our use of our Apache dependencies back to those project communities. =3D Developers =3D The SDAP project and hence developers is currently funded through a NASA AIST follow-on grant with funding secured for the next ~18 months. There are currently no 100% time dedicated developers, however, the same core team that does work currently will continue to work on the project throughout the next current funding period and after. There is currently no business strategy aligned with SDAP however it is perceived that future, yet unsecured funding may by directed to further feature advancement and project evangelism. =3D Documentation =3D Documentation is currently available in a number of locations e.g. Github wiki, Github pages, etc. with each repository under the oceanworks-aist Github Org maintaining documentation available through wiki=E2=80=99s attac= hed to the repositories. Additionally, most of the SDAP sub-projects have been extensively documented within plethora of formal academic publications across several academic communities. It would be our intention, certainly atleast to unify the Github wiki ad Github pages documentation most likely to make up the sdap.apache.org Website content. =3D Initial Source =3D Current source resides in several locations Github: * https://github.com/dataplumber/nexus (NEXUS, OceanXtremes, DOMS) * https://github.com/dataplumber/edge (EDGE) * https://github.com/aist-oceanworks/mudrod (MUDROD) * https://bitbucket.org/coaps_mdc/doms/src (DOMS) =3D External Dependencies =3D Each component of the Science Data Analytics Platform has its own dependencies. Documentation will be available for integrating them. =3D=3D MUDROD =3D=3D '''Core''' com.google.code.gson gson 2.5 compile jar false org.jdom jdom 2.0.2 compile jar false org.elasticsearch elasticsearch 5.2.0 compile jar false org.elasticsearch elasticsearch-spark-20_2.11 5.2.0 compile jar false joda-time joda-time 2.9.4 compile jar false com.carrotsearch hppc 0.7.1 compile jar false org.apache.spark spark-core_2.11 2.1.0 compile jar false org.apache.spark spark-sql_2.11 2.1.0 compile jar false org.apache.spark spark-mllib_2.11 2.1.0 compile jar false org.scala-lang scala-library 2.11.8 compile jar false org.codehaus.jettison jettison 1.3.8 compile jar false commons-cli commons-cli 1.2 compile jar false net.sf.opencsv opencsv 2.3 compile jar false org.apache.jena jena-core 3.3.0 compile jar false junit junit 4.12 test jar false '''Service''' gov.nasa.jpl.mudrod mudrod-core 0.0.1-SNAPSHOT compile jar false javax.servlet javax.servlet-api 3.1.0 provided jar false com.google.code.gson gson 2.5 compile jar false '''Web''' * AngularJS - MIT License * BootstrapJS - MIT License * jQueryJS - MIT License * Underscore JS - MIT License =3D=3D DOMS =3D=3D * Apache Solr version 5.5.1http://lucene.apache.org/solr/ * EDGE https://github.com/dataplumber/edge * NetCDF4 http://unidata.github.io/netcdf4-python/ * Python 3.5 (NOTE: only partial support for py2.7) Non stdlib Python dependencies: * Jinja2=3D=3D2.9.5 * python-dateutil=3D=3D2.6.0 * cython=3D=3D0.25.2 * numpy=3D=3D1.12.0 * scipy=3D=3D0.18.1 * netCDF4=3D=3D1.2.7 * solrpy3 * siphon=3D=3D0.4.0 * neo4j-driver=3D=3D1.1.0 * matplotlib=3D=3D2.0.0 * requests=3D=3D2.13.0 * shapely=3D=3D1.5.17 * flask=3D=3D0.12 * networkx=3D=3D1.11 * pyproj=3D=3D1.9.5.1 * blist=3D=3D1.3.6 =3D=3D NEXUS =3D=3D '''Analysis''' * https://github.com/dataplumber/nexus/blob/master/analysis/package-list.txt * https://github.com/dataplumber/nexus/blob/master/analysis/requirements.txt '''Client''' * https://github.com/dataplumber/nexus/blob/master/client/requirements.txt '''Climatology''' * matplotlib * numpy * netCDF4 * pathos (https://pypi.python.org/pypi/pathos) '''Data-access''' * https://github.com/dataplumber/nexus/blob/master/data-access/requirements.t= xt '''Nexus-ingest''' ''Dataset-tiler'' * https://github.com/dataplumber/nexus/tree/master/nexus-ingest/dataset-tiler= /build/reports ''developer-box'' * Just a collection of scripts/vagrant file used to stand up a developer instance of nexus ingestion. No dependencies to report ''Groovy-scripts'' * Collection of Groovy scripts that can be used as part of data ingestion. They only rely on the standard Groovy library and the =E2=80=98nexus-messag= es=E2=80=99 project ''Nexus-messages'' * https://github.com/dataplumber/nexus/tree/master/nexus-ingest/nexus-message= s/build/reports ''nexus-sink'' * https://github.com/dataplumber/nexus/tree/master/nexus-ingest/nexus-sink/bu= ild/reports ''nexus-xd-python-modules'' * https://github.com/dataplumber/nexus/blob/master/nexus-ingest/nexus-xd-pyth= on-modules/package-list.txt * https://github.com/dataplumber/nexus/blob/master/nexus-ingest/nexus-xd-pyth= on-modules/requirements.txt ''spring-xd-python'' * only python standard libraries are used ''tcp-shell'' * https://github.com/dataplumber/nexus/tree/master/nexus-ingest/tcp-shell/bui= ld/reports '''tools/deletebyquery''' * https://github.com/dataplumber/nexus/blob/master/tools/deletebyquery/requir= ements.txt =3D Required Resources =3D Mailing Lists * private@sdap.incubator.apache.org * dev@sdap.incubator.apache.org * commits@sdap.incubator.apache.org Git Repos * https://git-wip-us.apache.org/repos/asf/incubator-nexus.git * https://git-wip-us.apache.org/repos/asf/incubator-doms.git * https://git-wip-us.apache.org/repos/asf/incubator-mudrod.git Issue Tracking * JIRA Science Data Analytics Platform (SDAP) Continuous Integration * Jenkins builds on https://builds.apache.org/ Web * http://sdap.incubator.apache.org/ * wiki at http://cwiki.apache.org =3D Initial Committers =3D The following is a list of the planned initial Apache committers (the active subset of the committers for the current repository on Github). * Lewis John McGibbney (lewismc@apache.org) * Vardis M. Tsontos (vardis.m.tsontos@jpl.nasa.gov) * Joseph C. Jacob (Joseph.C.Jacob@jpl.nasa.gov) * Ed Armstrong (edward.m.armstrong@jpl.nasa.gov) * Frank Greguska (greguska@jpl.nasa.gov) * Brian Wilson (brian.wilson@jpl.nasa.gov) * Chaowe Phil Yang (cyang3@gmu.edu) * Yongyao Jiang (yjiang8@gmu.edu) * Yun Li (yli38@gmu.edu) * Shawn R. Smith (smith@coaps.fsu.edu) * Jocelyn Elya (jelya@coaps.fsu.edu) * Mark Bourassa (bourassa@coaps.fsu.edu) * Thomas Cram (tcram@ucar.edu) * Thomas Huang (thomas.huang@jpl.nasa.gov) * Steven Worley (worley@ucar.edu) * Zaihua Ji (zji@ucar.edu) =3D Affiliations =3D NASA JPL * Lewis John McGibbney (lewismc@apache.org) * Vardis M. Tsontos (vardis.m.tsontos@jpl.nasa.gov) * Joseph C. Jacob (Joseph.C.Jacob@jpl.nasa.gov) * Ed Armstrong (edward.m.armstrong@jpl.nasa.gov) * Frank Greguska (greguska@jpl.nasa.gov) * Thomas Huang (thomas.huang@jpl.nasa.gov) * Brian Wilson (brian.wilson@jpl.nasa.gov) George Mason University * Chaowe Phil Yang (cyang3@gmu.edu) * Yongyao Jiang (yjiang8@gmu.edu) * Yun Li (yli38@gmu.edu) Center for Ocean-Atmospheric Prediction Studies, Florida State University * Shawn R. Smith (smith@coaps.fsu.edu) * Jocelyn Elya (jelya@coaps.fsu.edu) * Mark Bourassa (bourassa@coaps.fsu.edu) Computational Information Systems Laboratory (CISL) / National Center for Atmospheric Research (NCAR) * Thomas Cram (tcram@ucar.edu) * Zaihua Ji (zji@ucar.edu) * Steven Worley (worley@ucar.edu) =3D Sponsors =3D =3D Champion =3D * Lewis McGibbney (NASA/JPL) =3D Nominated Mentors =3D * TBD * TBD * TBD =3D Sponsoring Entity =3D The Apache Incubator --=20 http://home.apache.org/~lewismc/ @hectorMcSpector http://www.linkedin.com/in/lmcgibbney --f403043ed42c2a8133055b4981ed--