incubator-cvs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <>
Subject [Incubator Wiki] Update of "SensSoftProposal" by LewisJohnMcgibbney
Date Mon, 23 May 2016 17:32:48 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Incubator Wiki" for change notification.

The "SensSoftProposal" page has been changed by LewisJohnMcgibbney:

New page:
= SensSoft Proposal =

== Abstract ==
The Software as a Sensor™ (SensSoft) Project offers an open-source (ALv2.0) software tool
usability testing platform. It includes a number of components that work together to provide
a platform for collecting data about user interactions with software tools, as well as archiving,
analyzing and visualizing that data. Additional components allow for conducting web-based
experiments in order to capture this data within a larger experimental framework for formal
user testing. These components currently support Java Script-based web applications, although
the schema for “logging” user interactions can support mobile and desktop applications,
as well. Collectively, the Software as a Sensor Project provides an open source platform for
assessing how users interacted with technology, not just collecting what they interacted with.

== Proposal ==
The Software as a Sensor™ Project is a next-generation platform for analyzing how individuals
and groups of people make use of software tools to perform tasks or interact with other systems.
It is composed of a number of integrated components:
 * User Analytic Logging Engine (User ALE) refers to a simple Application Program Interface
(API) and backend infrastructure. User ALE provides “instrumentation” for software tools,
such that each user interaction within the application can be logged, and sent as a JSON message
to an Elasticsearch/Logstash/Kibana (Elastic Stack) backend. 
   * The API provides a robust schema that makes user activities human readable, and provides
an interpretive context for understanding that activity’s functional relevance within the
application. The schema provides highly granular information best suited for advanced analytics.
This hierarchical schema is as follows:
     * Element Group: App features that share function (e.g., map group)
     * Element Sub: Specific App feature (e.g., map tiles)
     * Element Type: Category of feature (e.g., map)
     * Element ID: [attribute] id
     * Activity: Human imposed label (e.g., “search”) 
     * Action: Event class (e.g., zoom, hover, click)
   * The API can either be manually embedded in the app source code, or implemented automatically
by inserting a script tag in the source code.
   * Users can either setup up their own Elastic stack instance, or use Vagrant, a virtualization
environment, to deploy a fully configured Elastic stack instance to ship and ingest user activity
logs and visualize their log data with Kibana.
   * RESTful APIs allow other services to access logs directly from Elasticsearch.
   * User ALE allows adopters to own the data they collect from users outright, and utilize
it as they see fit.
 * Distill is an analytics stack for processing user activity logs collected through User
ALE. Distill is fully implemented in Python, dependent on graph-tool to support graph analytics
and other external python libraries to query Elasticsearch. The two principle functions of
Distill are segmentation and graph analytics:
   * Segmentation allows for partitioning of the available data along multiple axes. Subsets
of log data can be selected via their attributes in User ALE (e.g. Element Group or Activity),
and by users/sessions.  Distill also has the capability to ingest and segment data by additional
attributes collected through other channels (e.g. survey data, demographics).This allows adopters
to focus their analysis of log data on precisely the attributes of their app (or users) they
care most about. 
   * Distill’s usage metrics are derived from a probabilistic representation of the time
series of users’ interactions with the elements of the application. A directed network is
constructed from the representation, and metrics from graph theory (e.g. betweenness centrality,
in/out-degree of nodes) are derived from the structure. These metrics provide adopters ways
of understanding how different facets of the app are used together, and they capture canonical
usage patterns of their application. This broad analytic framework provides adopters a way
to develop and utilize their own metrics
 * The Test Application Portal (TAP) provides a single, user-friendly interface to Software
as a Sensor™ Project components, including visualization functionality for Distill Outputs
leveraging Django, React, and D3.js. It has two key functions:
   * It allows adopters to register apps, providing metadata regarding location, app name,
version, etc., as well as permissions regarding who can access user data. This information
is propagated to all other components of the larger system. 
   * The portal also stages visualization libraries that make calls to Distill. This allows
adopters to analyze their data as they wish to; it’s “dashboard” feel provides a way
to customize their views with adopter-generated widgets (e.g., D3 libraries) beyond what is
included in the initial open source offering.
 * The Subject Tracking and Online User Testing (STOUT) application is an optional component
that turns Software as a Sensor™ Technology into a research/experimentation enterprise.
Designed for psychologists and HCI/UX researchers, STOUT allows comprehensive human subjects
data protection, tracking, and tasking for formal research on software tools. STOUT is primarily
python, with Django back-end for authentication, permissions, and tracking, MongoDB for databasing,
and D3 for visualization. STOUT includes a number of key features:
   * Participants can register in studies of software tools using their own preferred credentials.
As part of registration, participants can be directed through human subjects review board
compliant consent forms before study enrollment.
   * STOUT stores URLs to web/network accessible software tools as well as URLs to third party
survey services (e.g., surveymonkey), this allows adopters to pair software tools with tasks,
and collect survey data and comments from participants prior to, during, or following testing
with software tools.
   * STOUT tracks participants’ progress internally, and by appending a unique identifier,
and task identifier to URLs. This information can be passed to other processes (e.g., User
ALE) allowing for disambiguation between participants and tasks in experiments on the open
   * STOUT supports between and within-subjects experimental designs, with random assignment
to experimental conditions. This allows for testing across different versions of applications.
   * STOUT can also use Django output (e.g., task complete) to automate other processes, such
as automated polling applications serving 3rd party form data APIs (e.g.,SurveyMonkey), and
python or R scripts to provide automated post-processing on task or survey data.
   * STOUT provides adopters a comprehensive dashboard view of data collected and post-processed
through its extensions; in addition to user enrollment, task completion, and experiment progress
metrics, STOUT allows adopters to visualize distributions of scores collected from task and
survey data.
Each component is available through its own repository to support organic growth for each
component, as well as growth of the whole platform’s capabilities.

== Background and Rationale ==
A number of factors make this a good time for an Apache project focused on machine translation
(MT): the quality of MT output (for many language pairs); the average computing resources
available on computers, relative to the needs of MT systems; and the availability of a number
of high-quality toolkits, together with a large base of researchers working on them.

Over the past decade, machine translation (MT; the automatic translation of one human language
to another) has become a reality. The research into statistical approaches to translation
that began in the early nineties, together with the availability of large amounts of training
data, and better computing infrastructure, have all come together to produce translations
results that are “good enough” for a large set of language pairs and use cases. Free services
like [[|Bing Translator]] and  [[|Google
Translate]] have made these services available to the average person through direct interfaces
and through tools like browser plugins, and sites across the world with higher translation
needs use them to translate their pages through automatically.

MT does not require the infrastructure of large corporations in order to produce feasible
output. Machine translation can be resource-intensive, but need not be prohibitively so. Disk
and memory usage are mostly a matter of model size, which for most language pairs is a few
gigabytes at most, at which size models can provide coverage on the order of tens or even
hundreds of thousands of words in the input and output languages. The computational complexity
of the algorithms used to search for translations of new sentences are typically linear in
the number of words in the input sentence, making it possible to run a translation engine
on a personal computer.

The research community has produced many different open source translation projects for a
range of programming languages and under a variety of licenses. These projects include the
core “decoder”, which takes a model and uses it to translate new sentences between the
language pair the model was defined for. They also typically include a large set of tools
that enable new models to be built from large sets of example translations (“parallel data”)
and monolingual texts. These toolkits are usually built to support the agendas of the (largely)
academic researchers that build them: the repeated cycle of building new models, tuning model
parameters against development data, and evaluating them against held-out test data, using
standard metrics for testing the quality of MT output.

Together, these three factors—the quality of machine translation output, the feasibility
of translating on standard computers, and the availability of tools to build models—make
it reasonable for the end users to use MT as a black-box service, and to run it on their personal

These factors make it a good time for an organization with the status of the Apache Foundation
to host a machine translation project.

== Current Status ==
Joshua was originally ported from David Chiang’s Python implementation of Hiero by Zhifei
Li, while he was a Ph.D. student at Johns Hopkins University. The current version is maintained
by Matt Post at Johns Hopkins’ Human Language Technology Center of Excellence. Joshua has
made many releases with a list of over 20 source code tags. The last release of Joshua was
6.0.5 on November 5th, 2015.

== Meritocracy ==
The current developers are familiar with meritocratic open source development at Apache. Apache
was chosen specifically because we want to encourage this style of development for the project.

== Community ==
Joshua is used widely across the world. Perhaps its biggest (known) research / industrial
user is the Amazon research group in Berlin. Another user is the US Army Research Lab. No
formal census has been undertaken, but posts to the Joshua technical support mailing list,
along with the occasional contributions, suggest small research and academic communities spread
across the world, many of them in India.

During incubation, we will explicitly seek to increase our usage across the board, including
academic research, industry, and other end users interested in statistical machine translation.

== Core Developers ==
The current set of core developers is fairly small, having fallen with the graduation from
Johns Hopkins of some core student participants. However, Joshua is used fairly widely, as
mentioned above, and there remains a commitment from the principal researcher at Johns Hopkins
to continue to use and develop it. Joshua has seen a number of new community members become
interested recently due to a potential for its projected use in a number of ongoing DARPA
projects such as XDATA and Memex.

== Alignment ==
Joshua is currently Copyright (c) 2015, Johns Hopkins University All rights reserved and licensed
under BSD 2-clause license. It would of course be the intention to relicense this code under
AL2.0 which would permit expanded and increased use of the software within Apache projects.
There is currently an ongoing effort within the Apache Tika community to utilize Joshua within
Tika’s Translate API, see [[|TIKA-1343]].

== Known Risks ==

=== Orphaned products ===
At the moment, regular contributions are made by a single contributor, the lead maintainer.
He (Matt Post) plans to continue development for the next few years, but it is still a single
point of failure, since the graduate students who worked on the project have moved on to jobs,
mostly in industry. However, our goal is to help that process by growing the community in
Apache, and at least in growing the community with users and participants from NASA JPL.

=== Inexperience with Open Source ===
The team both at Johns Hopkins and NASA JPL have experience with many OSS software projects
at Apache and elsewhere. We understand "how it works" here at the foundation.

== Relationships with Other Apache Products ==
Joshua includes dependences on Hadoop, and also is included as a plugin in Apache Tika. We
are also interested in coordinating with other projects including Spark, and other projects
needing MT services for language translation.

== Developers ==
Joshua only has one regular developer who is employed by Johns Hopkins University. NASA JPL
(Mattmann and McGibbney) have been contributing lately including a Brew formula and other
contributions to the project through the DARPA XDATA and Memex programs.

== Documentation ==
Documentation and publications related to Joshua can be found at The source
for the Joshua documentation is currently hosted on Github at

== Initial Source ==
Current source resides at Github: (the main decoder and toolkit)
and (the grammar extraction tool).

== External Dependencies ==
Joshua has a number of external dependencies. Only BerkeleyLM (Apache 2.0) and KenLM (LGPL
2.1) are run-time decoder dependencies (one of which is needed for translating sentences with
pre-built models). The rest are dependencies for the build system and pipeline, used for constructing
and training new models from parallel text.

Apache projects:
 * Ant
 * Hadoop
 * Commons
 * Maven
 * Ivy

There are also a number of other open-source projects with various licenses that the project
depends on both dynamically (runtime), and statically.

=== GNU GPL 2 ===
 * Berkeley Aligner:

=== LGPL 2.1 ===
 * KenLM:

=== Apache 2.0 ===
 * BerkeleyLM:

=== GNU GPL ===
 * GIZA++:

== Required Resources ==
 * Mailing Lists

 * Git Repos

 * Issue Tracking
   * JIRA Joshua (JOSHUA)

 * Continuous Integration
   * Jenkins builds on

 * Web
   * wiki at

== Initial Committers ==
The following is a list of the planned initial Apache committers (the active subset of the
committers for the current repository on Github).

 * Matt Post (
 * Lewis John McGibbney ( 
 * Chris Mattmann (
 * Henry Saputra (
 * Tommaso Teofili (
 * Tom Barber (

== Affiliations ==

 * Johns Hopkins University
   * Matt Post

   * Chris Mattmann
   * Lewis John McGibbney

== Sponsors ==

=== Champion ===
 * Chris Mattmann (NASA/JPL)

=== Nominated Mentors ===
 * Paul Ramirez
 * Lewis John McGibbney
 * Chris Mattmann
 * Tom Barber
 * Henri Yandell

== Sponsoring Entity ==
The Apache Incubator

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message