incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: [Vote] accept UIMA as a podling - #2
Date Wed, 27 Sep 2006 09:07:07 GMT
+1


On Sep 26, 2006, at 7:17 PM, Ian Holsman wrote:

>
>
> issues addressed in this release:
> 1. updated proposal included
> 2. The first paragraph explains it to a layperson
> 3. OASIS issue addressed
>
>
> [ ] +1 Accept UIMA as an Incubator podling
> [ ]   0 Don't care
> [ ] -1 Reject this proposal for the following reason:
>
>
> ----8<-------Proposal------8<------
>
>
> Hello everyone -
>
> We are submitting this proposal to the community for a
> new project in the incubator, and look forward to starting to work  
> with
> this community.
>
> This is a slightly modified and extended version of the proposal  
> that has
> already been posted to general@incubator.apache.org.  The whole  
> mail thread
> can be found [http://www.nabble.com/Proposal-for-a-new-incubation- 
> project%3A-Unstructured-Information-Management-Architecture---UIMA- 
> tf2154324.html here].
>
> If you don't feel like reading the whole thread, the main question  
> that came up was:
> this is all very well, but what does it really '''do'''?  Attempts  
> to answer that question
> where made [http://www.nabble.com/Re%3A-Proposal-for-a-new- 
> incubation-project%3A-Unstructured-Information-Management- 
> Architecture---UIMA-p5986403.html here] and [http://www.nabble.com/ 
> Re%3A-Proposal-for-a-new-incubation-project%3A-Unstructured- 
> Information-Management-Architecture---UIMA-p5987788.html here].  We  
> have since worked some of these into the proposal itself.
>
> ----
>
> = Proposal for Incubation Project: Unstructured Information  
> Management Architecture - UIMA =
>
> == Abstract ==
>
> UIMA is a component framework for the analysis of unstructured  
> content such as text, audio and video.  It comprises an SDK and  
> tooling for composing and running analytic components written in  
> Java and C++.
>
>
> == Proposal:  Unstructured Information Management Architecture  
> framework ==
>
> Unstructured Information Management applications are software  
> systems that analyze large volumes of unstructured information in  
> order to discover knowledge that is relevant to an end user.  We  
> propose UIMA, a framework and SDK for developing such  
> applications.  An example UIM application might ingest plain text  
> and identify entities, such as persons, places, organizations; or  
> relations, such as works-for or located-at.  UIMA enables such an  
> application to be decomposed into components, for example  
> ''"language identification"'' -> ''"language specific  
> segmentation"'' -> ''"sentence boundary detection"'' -> ''"entity  
> detection (person/place names etc.)"''.  Each component must  
> implement interfaces defined by the framework and must provide self- 
> describing metadata via XML descriptor files.  The framework  
> manages these components and the data flow between them.   
> Components are written in Java or C++; the data that flows between  
> components is designed for efficient mapping between these  
> languages.  UIMA additionally provides capabilities to wrap  
> components as network services, and can scale to very large volumes  
> by replicating processing pipelines over a cluster of networked nodes.
>
> This framework has already attracted a following among government,  
> commercial, and academic institutions who previously developed  
> analysis algorithms, but were unable to easily build on each  
> other's works, and who want to be able to evolve their applications  
> by independently upgrading parts, as better technology becomes  
> available.  Applications built with this framework are being used  
> with plain text, audio streams, and image/video streams,  
> identifying entities and relations, converting speech to text,  
> translating into different languages, and determining properties of  
> images.
>
> The UIMA framework runs components in a flow, passing a common data  
> object containing unstructured information (free text, audio,  
> video, etc.) through the components.  Each component examines the  
> unstructured information and data added by other components, and  
> adds data of its own.  The framework mandates a standardized form  
> of the data being passed, and a standardized form of the interfaces  
> to the components.
>
> We propose a project to develop, implement, support and enhance  
> this framework (and, over time, other implementations) that comply  
> with the UIMA standard (which has been submitted for  
> standardization work within [http://www.oasis-open.org OASIS].   
> Members of this community are encouraged to participate in that  
> effort, as well; OASIS has an open approach to granting Technical  
> Committee voting rights to members of OASIS, described here: http:// 
> www.oasis-open.org/committees/process.php#2.4.
>
> The proposal includes both the framework, as well as tools to  
> develop, describe, compose and deploy UIMA-based components and  
> applications. The initial work will be based on the UIMA Version 2  
> framework code developed by IBM; snapshots of each release of this  
> code are currently made available on [http://sourceforge.net/ 
> projects/uima-framework SourceForge]. The Source``Forge versions  
> would be stabilized in maintenance mode, if we are successful in  
> moving to Apache.
> The framework is not specific to any IDE or platform, and does not  
> depend on other middleware.
> Background:
>
> Databases are core components of nearly all applications; they  
> store information in structured tables.  But more and more of the  
> available digital data is unstructured (e.g. email, web documents,  
> images, audio clips, video streams) with little information  
> (metadata) attached to explain its content or context.  Although  
> many applications have been built to process unstructured data,  
> they have either managed it as a BLOB or they have developed  
> isolated applications for analyzing the content.  In the absence of  
> a standardized means for analytical applications to share insights  
> extracted from the content, analytical applications cannot build  
> upon one another. As a result, the industry has barely begun to tap  
> the value locked in unstructured information.
>
> Standardization is key to achieving component interoperability,  
> with capabilities to mix components developed in different places  
> and in Java, C++ and other languages.  The Unstructured Information  
> Management Architecture defines standards for component  
> interoperability and application composition that will provide this  
> needed unifying standard, and allow a variety of framework  
> implementations to exist, while preserving the goal of unstructured  
> information analytic component reuse.
>
> UIMA was built to help developers create solutions that get more value
> from unstructured information more quickly and at lower cost by making
> it easy to reuse and combine analytic modules from different  
> sources into new analytic applications. The architecture and the  
> framework have been validated through work with USA's DARPA which  
> is using it as a standard for key projects with several  
> universities involved in advanced linguistics analysis, such as  
> Carnegie Mellon, Columbia, Stanford and University of  
> Massachusetts.  Other companies, such as the Mayo Clinic and Sloan  
> Kettering, are also building efforts around UIMA.  In addition,  
> over 15 software vendors, including companies such as Inxight,  
> Attensity, Clear``Forest, Temis, SPSS, SAS, Cognos, Endeca, Factiva  
> and others, announced plans to support UIMA.
>
> The UIMA framework (binary and/or source code) has been downloaded  
> over 8000 times from IBM alphaWorks (http://www.alphaworks.ibm.com/ 
> tech/uima) or Source``Forge  (http://uima-framework.sourceforge.net).
>
> == Rationale ==
>
> We believe that moving the UIMA framework development to the Apache  
> development community will lead to faster innovation, better  
> integration with other open source software, and broader adoption  
> of UIMA, accelerating the industry's ability to get the most value  
> from text, audio, and video content. The UIMA framework is becoming  
> attractive to developers who want to build components; we believe  
> that having UIMA on Apache will encourage the development of a  
> basic set of open source components that will jumpstart these  
> developers' efforts. One of the first components we see possible  
> synergy with is a search component based on Apache Lucene that  
> would enable semantic search.  We like the concept of the Lucene  
> Sandbox as a way to encourage innovation around UIMA, and would  
> envision something similar for this project.
>
>
> == Initial Goals ==
>
> Some initial work we see in the incubator includes the following:
>
> * redoing the parts of the tooling that were done as derivative  
> works of Eclipse source code, to
> enable everything to be licensable under the Apache license
> * extending the framework to better support "scale-out"
> * extending the framework to better align with the emerging UIMA  
> Standards work
> * extending the framework to support XMI-based SOAP and/or other  
> service interfaces
> * extending the framework to support OSGi-based approaches to  
> componentization and packaging
> * exploring embeddings of the framework within other interested  
> Apache projects, including synergies with Lucene
> * providing aids to the community to migrate from previous versions  
> of the framework to the Apache version
> * setting up community support: hosting a facility similar to the  
> Lucene Sandbox to encourage innovation and
> experimentation; establishing a wiki and some process to allow  
> better documentation to be developed by the community,
> and linking our existing XHTML documentation via an XSL transform  
> to Apache FOP
>
>
> == Current Status ==
> === Meritocracy ===
>
> Meritocracy seems to us an ideal way to grow the community of  
> developers around UIMA, it being a controlled, rational way to give  
> those who positively contribute, more ability to directly  
> contribute.  This approach also gives contributors one of the best  
> reasons to join the community of volunteers - to be recognized for  
> the merit of their contributions.
>
> === Community ===
> Currently, the UIMA Framework development is being done by IBM,  
> with input from a group of early adopters in industry and  
> government.  Going forward, we see IBM continuing to support  
> several committers working on it.  We have already begun talking  
> with other people outside of IBM that have expressed interest in  
> contributing towards the development.  This includes members of  
> academic institutions, people working for some of the software  
> vendors that have announced plans to support UIMA, and others from  
> companies that have expressed interest since initial announcements  
> about our open source plans.  Multiple non-IBM people have already  
> expressed desires to become committers.
>
> === Core Developers ===
> The previous core developers of UIMA are Adam Lally, Thilo Goetz,  
> Marshall Schor, Edward Epstein, Jaroslaw Cwiklik and Thomas  
> Hampp.   Many others have also contributed.  The developers come  
> from both the Research and Development parts of IBM.
>
> === Alignment ===
> UIMA has significant synergy with search applications, and we  
> expect to see integration with Lucene in the future. UIMA makes use  
> of the Apache Portable Runtime (APR) for C++ support.  It is  
> designed to be embeddable into other frameworks, such as web  
> application servers.  Part of UIMA is Eclipse-based tooling.  We  
> use ANT for build scripting.   UIMA has support for various  
> language bindings including C++ and Java; we also have more limited  
> bindings for Perl, Python, and TCL.  UIMA uses Web Services as part  
> of its approach to wiring up components in its domain.  It makes  
> use of XML services such as Xerces and Xalan.
>
> The development of UIMA has been based on merit with open  
> discussion among a distributed team of developers, from both  
> Research and Development organizations.
>
> === License ===
>
> The current license for the source code is CPL, with a small number  
> of files licensed under the EPL (Eclipse Public License), because  
> these were created as "derivative works" of existing Eclipse open  
> source code.  When the code base is moved to Apache, it will be  
> relicensed under the Apache license, except for the small number of  
> files licensed under the EPL as derivative works of Eclipse source  
> files.  We plan to work in the incubator to redo these parts, so  
> the entire offering can be licensed under the Apache license.
>
> The distribution for the C++ enablement layer includes open source  
> components ICU (a Unicode package) which has its own license.  We  
> plan to work with community to properly make use of this non-Apache  
> licensed component. Our current vision for the future of UIMA has  
> it aligning with and incorporating other standards-based open  
> source components/protocols, some of which may have licensing other  
> than the Apache license (for example, the Xml Metadata Interchange  
> (XMI), and the EMF ECore Model from Eclipse); we will work with the  
> community in figuring out how to move forward on this.
>
> === Other IP ===
>
> When we requested OASIS to set up a Technical Committee chartered  
> to develop a platform-independent specification for text and multi- 
> modal analysis, we specified that it be set up under the "RF on  
> Limited Terms" mode of the OASIS IP Policy.  "RF" means Royalty  
> Free, and the Limited Terms means companies that are working with  
> us on the Technical Committee are restricted in adding additional  
> terms.
>
> These are the most liberal terms and make any Essential Claims  
> available to ALL and ROYALTY FREE.
> For the details please refer to:
>
> * http://www.oasis-open.org/who/ipr/ipr_faq.php
> * http://www.oasis-open.org/who/intellectualproperty.php
>
> Ultimately of course, there is always a risk that someone in the  
> world holds a patent that can be claimed as Essential. The most any  
> standards organization can do is govern the behavior of those who  
> participate in its work and publicly document the licensing  
> commitment of all participants.
>
> == Known Risks ==
>
> === Orphaned Software ===
>
> UIMA has been in active development for 5 years.  The community of  
> users has steadily grown, and there are now significant commercial  
> and research organizations actively using it.  UIMA is embedded in  
> IBM software products and is delivered through IBM services  
> engagements. IBM has developers assigned to it, and is continuing  
> to support its development.  In addition, several people outside of  
> IBM have already expressed interest in working on UIMA, and have  
> been providing IBM with initial feedback. One of the objectives of  
> starting this Apache project is to provide a meritocratic structure  
> for those people to begin more actively contributing to UIMA.
>
> === Inexperience with Open Source ===
>
> The individuals working on this software have background as IBM  
> software developers.  While many of them have experience working  
> with open source software, none of them has had extensive  
> experience contributing to other open source software.  However,  
> IBM as an organization has extensive experience contributing to  
> open source projects and will make available resources to provide  
> guidance to the developers working on this project.
>
> === Homogenous Developers (work for same company?) ===
>
> Currently all the developers work for IBM, although they come from  
> different geographically dispersed organizations within IBM.  We  
> will reach out during the incubation time to get others to  
> contribute; we have already received interest from several parties.
>
> === Reliance on salaried developers ===
>
> Currently the developers are paid employees of IBM.
>
> === Relationships with Other Apache Products ===
>
> We make use of several Apache components (SOAP / Web Services, XML  
> (Xerces, Xalan), languages (Perl), scripting languages (ANT),  
> Apache Portable Runtime.  In addition, UIMA has been embedded in  
> other frameworks, such as web application servers, and integrated  
> with search engines.  We are exploring Lucene extensions that could  
> take advantage of UIMA processed data.  We are currently  
> investigating and prototyping some software packaging concepts  
> based on OSGi; the Apache Incubator project Felix may have  
> relevance as we go forward.  The documentation is being moved to  
> XHTML and plans to use Apache FOP for producing PDF reference  
> materials.
>
> === An Excessive Fascination with the Apache Brand ===
> UIMA is already being adopted by a wide cross section of users,  
> both commercial and academic, world-wide. Our experience shows that  
> analytic modules can be reused and combined through UIMA making it  
> easier and faster for developers to build new analytic applications  
> for specific industries or domains. Given the diversity of content  
> and analytics that will be required to address the multitude of  
> opportunities - from military intelligence to quality assurance to  
> contact center analytics -- growing this infrastructure so that it  
> better aligns with other major Open Source communities should help  
> accelerate industry's ability to get value from content assets.
>
> We believe that the Apache community of developers has the  
> experience, background, visibility, and synergistic resources to  
> encourage and foster a vibrant developer community around this  
> project.
>
>
> == Documentation ==
>
> There is a combination Introduction, Conceptual Overview, Tutorial,  
> Tools and Framework User's Guides and References, downloadable from  
> http://dl.alphaworks.ibm.com/technologies/uima/ 
> UIMA_SDK_Users_Guide_Reference_2.0.pdf
>
>
> == Scope of the project ==
>
> The project will develop implementations of the UIMA architecture  
> (which is concurrently being submitted to the OASIS standards  
> process), supporting the breadth of platforms that developers  
> working in this field are using, including Java, C++, Perl, Python  
> and TCL; and utilities and tooling to support component and  
> application developers and assemblers / packagers.  It will  
> initially include the Java UIMA framework for UIMA Version 2 (you  
> can see a snap shot of the Version 2 release Source``Forge; the  
> delivered code would this code base plus normal incremental bug  
> fixes and improvements), plus additional components (mainly  
> documentation and test cases, which are not currently on  
> Source``Forge).  Over time, the project is expected grow to include  
> supporting various embeddings and integrations with other Apache  
> components such as search engines and web application frameworks.
>
> Over time, we envision the project becoming an umbrella for related  
> open-source around UIMA, including things like open-source pre- 
> annotated corpora, and hosting a facility similar to the Lucene  
> Sandbox to encourage innovation and experimentation.
>
> The UIMA framework is primarily a set of libraries (in Java, C++,  
> Perl, etc.), test cases, and UIMA utilities and tools (scripts,  
> plugins, executables, etc.) used to build, test and debug UIMA  
> analytic components.  The tooling includes several Eclipse platform  
> plugins.
>
> == Initial source ==
>
> The source currently is maintained in IBM internal software control  
> systems, with a copy of each release placed on SourceForge.  At the  
> time of launch, we plan to contribute the latest version of the  
> code base (with some renaming of package prefixes to reflect  
> apache.org), test cases, build files, and documentation, under the  
> terms specified in the ASF Corporate Contributor License.  We plan  
> to donate the existing C++ enablement layer and the support for  
> Perl, Python, and TCL a few months later than the initial donation;  
> this delay is to give us time to finish preparing that code base  
> for Open Source.
>
> == ASF resources to be created ==
>
> Mailing lists:
> * uima-dev
> * uima-commits
> * uima-user (we already have a substantial user community and  
> expect them to turn up at Apache
> soon after we've hopefully been accepted into the incubator)
>
> For other resources such as Subversion repository, JIRA etc. we  
> hope for guidance from our mentors.
>
> == Initial Set of Committers ==
>
> * Michael Baessler (mba@michael-baessler.de)
> * Edward Epstein (eddie@aewatercolors.com)
> * Thilo Goetz  (twgoetz@gmx.de)
> * Adam Lally  (alally@alum.rpi.edu)
> * Marshall Schor (msa@schor.com)
>
> === Sponsor ===
>
> We are requesting the Incubator to sponsor this.  Our current  
> vision is that it will become a top level project (other projects  
> that develop UIMA components could become subprojects, for instance).
>
> === Mentors ===
>
> * Sam Ruby (ruby@apache.org)
> * Ken Coar (coar@apache.org)
> * Ian Holsman (lists@holsman.net)
>
> === Section 6: Open Issues for Discussion ===
>
>
> --
> Ian Holsman
> Ian@Holsman.net
> http://garden-gossip.com/ -- what's in your garden?
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message