incubator-cvs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Incubator Wiki] Trivial Update of "Any23Proposal" by PaulRamirez
Date Thu, 22 Sep 2011 14:29:26 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Incubator Wiki" for change notification.

The "Any23Proposal" page has been changed by PaulRamirez:
http://wiki.apache.org/incubator/Any23Proposal?action=diff&rev1=45&rev2=46

Comment:
Minor grammar changes

  = Any23 =
- 
  == Abstract ==
+ The following proposal is about ''Anything To Triples'' (shortly Any23) defined as a Java
library,  a Web service and a set of command line tools to extract and validate structured
data  in [[http://www.w3.org/RDF/|RDF]] format from a variety of Web documents and markup
formats.  Any23 is what it is informally named an ''RDF Distiller''.
- The following proposal is about ''Anything To Triples'' (shortly Any23) defined as a Java
library, 
- a Web service and a set of command line tools to extract and validate structured data 
- in [[http://www.w3.org/RDF/|RDF]] format from a variety of Web documents and markup formats.

- Any23 is what it is informally named an ''RDF Distiller''.
  
  == Proposal ==
+ Any23 "Anything to Triples" is a library written in Java 6 and released under the Apache
2.0 License. It provides a set of extractors for scraping semantic markup (such as [[http://microformats.org/|Microformats]],
[[http://www.w3.org/TR/rdfa-syntax/|RDFa]] and [[http://www.w3.org/TR/microdata/|Microdata]])
 from several sources (HTML4, XHTML5, CSV), a set of data validations, a set of parsers and
writers to handle the main RDF transport formats (RDFXML, Ntriples, NQuads, Turtle).  The
library provides a command line tool for dealing with data extraction, conversion and validation,
and a REST service implementation. The library is plugin based, allowing the hot loading of
new extractors and validators. Any23 enables third-parties developers to access structured
data from Web pages without the need of implementing ad-hoc scraping techniques. In this sense,
Any23 will relieve developers from build complex solutions when developing data acquisition
pipelines and processes targeted to semantically marked-up Web data.
- Any23 "Anything to Triples" is a library written in Java 6 and released under the Apache
2.0 License.
- It provides a set of extractors for scraping semantic markup (such as [[http://microformats.org/|Microformats]],
[[http://www.w3.org/TR/rdfa-syntax/|RDFa]] and [[http://www.w3.org/TR/microdata/|Microdata]])

- from several sources (HTML4, XHTML5, CSV), a set of data validations, a set of parsers and
writers to handle the
- main RDF transport formats (RDFXML, Ntriples, NQuads, Turtle). 
- The library provides a command line tool for dealing with data extraction, conversion and
validation,
- and a REST service implementation. The library is plugin based, allowing the hot loading
of new extractors and validators.
- Any23 enables third-parties developers to access structured data from Web pages without
the need of implementing ad-hoc scraping
- techniques. In this sense, Any23 will relieve developers from build complex solutions when
developing data acquisition
- pipelines and processes targeted to semantically marked-up Web data. 
  
  == Background ==
+ Any23 has been initially developed at [[http://www.deri.ie/|DERI (Digital Enterprise Research
Institute)]],  as main component of the RDF extraction pipeline used in [[http://sindice.com/|Sindice
(the Semantic Web Index)]], now is evolved in joint effort with [[http://www.fbk.eu/|FBK (Fondazione
Bruno Kessler)]]. At present time the Any23 official [[http://developers.any23.org|developers
page]] contains all the documentation, while the code is maintained on [[http://code.google.com/p/any23/|Google
Code]]. An official up-to-date showcase [[http://any23.org|demo]] is also available.
- Any23 has been initially developed at [[http://www.deri.ie/|DERI (Digital Enterprise Research
Institute)]], 
- as main component of the RDF extraction pipeline used in [[http://sindice.com/|Sindice (the
Semantic Web Index)]],
- now is evolved in joint effort with [[http://www.fbk.eu/|FBK (Fondazione Bruno Kessler)]].
At present time the Any23
- official [[http://developers.any23.org|developers page]] contains all the documentation,
while the code is maintained
- on [[http://code.google.com/p/any23/|Google Code]]. An official up-to-date showcase [[http://any23.org|demo]]
is also available.
  
  == Rationale ==
+ Provide and maintain a robust, standard and updated library for extracting and validating
semantic markup from heterogeneous sources would provide large benefits to the entire Open
Source Community. Researchers and academic projects are adopting RDF related technologies
from years  while the industry is actually moving toward Semantic Web technologies with more
concreteness. Several industry initiatives related to the [[http://en.wikipedia.org/wiki/Semantic_Web|Web
of Data]]  are taking place in the these months. [[http://schema.org|Schema.org]], for example,
is an initiative sponsored by  [[http://www.google.com/about/corporate/company/|Google Inc]],
[[http://info.yahoo.com/center/us/yahoo/|Yahoo Inc]]  and [[http://www.microsoft.com/about/companyinformation/en/us/default.aspx|Microsoft
Corporation]]  to structure the data in a harmonized way on [[http://dev.w3.org/html5/spec/Overview.html|HTML5]]
pages. [[http://schema.org|Schema.org]] leverages on the [[http://dev.w3.org/html5/md/|HTML5
Microdata]] native specification. [[http://ogp.me/|OpenGraphProtocol]] is the open standard
sponsored by  [[https://www.facebook.com/pages/Facebooking/114721225206500|Facebook Inc]]
to include metadata in HTML page headers.  [[http://ogp.me/|OpenGraphProtocol]], initially
based on [[http://www.w3.org/TR/xhtml-rdfa-primer/|RDFa]], allows to describe the content
of a Web page and its underlying vocabulary could be directly represented using RDF.
- Provide and maintain a robust, standard and updated library for extracting and validating
- semantic markup from heterogeneous sources would provide large benefits to the entire Open
Source
- Community. Researchers and academic projects are adopting RDF related technologies from
years 
- while the industry is actually moving toward Semantic Web technologies with more concreteness.
- Several industry initiatives related to the [[http://en.wikipedia.org/wiki/Semantic_Web|Web
of Data]] 
- are taking place in the these months. [[http://schema.org|Schema.org]], for example, is
an initiative sponsored by 
- [[http://www.google.com/about/corporate/company/|Google Inc]], [[http://info.yahoo.com/center/us/yahoo/|Yahoo
Inc]] 
- and [[http://www.microsoft.com/about/companyinformation/en/us/default.aspx|Microsoft Corporation]]

- to structure the data in a harmonized way on [[http://dev.w3.org/html5/spec/Overview.html|HTML5]]
pages.
- [[http://schema.org|Schema.org]] leverages on the [[http://dev.w3.org/html5/md/|HTML5 Microdata]]
native
- specification. [[http://ogp.me/|OpenGraphProtocol]] is the open standard sponsored by 
- [[https://www.facebook.com/pages/Facebooking/114721225206500|Facebook Inc]] to include metadata
in
- HTML page headers. 
- [[http://ogp.me/|OpenGraphProtocol]], initially based on [[http://www.w3.org/TR/xhtml-rdfa-primer/|RDFa]],
allows to
- describe the content of a Web page and its underlying vocabulary could be directly represented
using RDF.
  
  = Current Status =
- 
  == Meritocracy ==
+ The historical Any23 team believes in meritocracy and always acted as a community. Mailing
list, open issue tracker and other communication channels have always been adopted since its
first release. The adoption in a larger community, such as Apache,  is the natural evolution
for Any23. Moreover, the Apache standards will enforce the existing Any23 community practices
and will be a foundation for future committers involvement.
- The historical Any23 team believes in meritocracy and always acted as a community.
- Mailing list, open issue tracker and other communication channels have always been
- adopted since its first release. The adoption in a larger community, such as Apache, 
- is the natural evolution for Any23. Moreover, the Apache standards will enforce the
- existing Any23 community practices and will be a foundation for future committers
- involvement.
  
  == Core Developers ==
  In alphabetical order:
@@ -66, +30 @@

   * Tommaso Teofili <tommaso at apache dot org>
  
  == Alignment ==
+ Main aim of the project is to develop and maintain a fully flavored semantic  markup distiller
that can be used by other Apache projects that need an RDF extraction tool. The Any23 library
core is written using the following Apache libraries.
- 
- Main aim of the project is to develop and maintain a fully flavored semantic 
- markup distiller that can be used by other Apache projects that need an RDF extraction
- tool. The Any23 library core is written using the following Apache libraries.
  
   * [[http://commons.apache.org/lang/|Apache Commons Lang]]
   * [[http://hc.apache.org/httpclient-3.x/|Apache Commons HTTP Client]]
@@ -78, +39 @@

   * [[http://commons.apache.org/cli/|Apache Commons CLI]]
   * [[http://poi.apache.org/|Apache POI]]
  
- The Any23 service is targeted to run within any compliant Servlet 
+ The Any23 service is targeted to run within any compliant Servlet  container like Tomcat.
- container like Tomcat.
  
  = Known Risks =
  == Orphaned Products ==
+ The increasing number of Any23 adopters and the raising interest for Semantic Web related
technologies let us believe that there is a minimal risk for this work to being abandoned
 from the community. Moreover Any23 has already been used in production by Sindice.com and
 other DERI projects for years.
- The increasing number of Any23 adopters and the raising interest for Semantic Web related
- technologies let use believe that there is a minimal risk for this work to being abandoned

- from the community. Moreover Any23 has been already used in production by Sindice.com and

- other DERI projects since years.
  
  == Inexperience with Open Source ==
  All of the committers have experience working in one or more open source projects inside
and outside ASF.
  
  == Homogeneous Developers ==
+ The list of initial committers are geographically distributed across Europe with no one
company being associated with a majority of the developers.  Many of these initial developers
are experienced Apache committers already  and all are experienced with working in distributed
development communities.
- The list of initial committers are geographically distributed across 
- the Europe with no one company being associated with a majority of the developers. 
- Many of these initial developers are experienced Apache committers already 
- and all are experienced with working in distributed development communities.
  
  == Reliance on Salaried Developers ==
+ To the best of our knowledge, the biggest part of the initial committers is being paid to
develop code for this project due to the adoption of Any23 in their organizations infrastructures.
In any case, some of the core historical developers (some of them no longer getting paid from
the original companies behind Any23)  are still committing even if Any23 is not employed in
their actual organizations. Any23 has already proven its capability to attract external developers.
- To the best of our knowledge, the biggest part of the initial committers is being paid to
develop code for this project due to
- the adoption of Any23 in their organizations infrastructures.
- In any case, some of the core historical developers (some of them no longer getting paid
from the original companies behind Any23) 
- are still committing even if Any23 is not employed in their actual organizations. Any23
has already proven its capability to
- attract external developers.
  
  == Relationships with Other Apache Products ==
+ In the last years, other projects have been under ASF incubation process relying on the
Semantic Web technology stack, such as Apache Clerezza, Stanbol and Jena. This could be seen
as a proof of the consolidation and the adoption growing tendency of such technologies. Apart
the specificity of those projects, sharing the same underlying stack, Any23 could be employed
in every projects needing a reliable framework to access structured semantic markup. Any23
core could be easily released also as a  [[http://wiki.apache.org/nutch/PluginCentral|Apache
Nutch Plugin]] and then, used to handy fill [[http://www.openrdf.org/doc/sesame2/system/ch05.html|SAIL-compliant]]
triple stores.
- In the last years, other projects have been under ASF incubation process relying on the
Semantic Web technology stack, such as Apache Clerezza, Stanbol and Jena. This could be seen
as a proof of the consolidation and the adoption growing tendency of such technologies.
- Apart the specificity of those projects, sharing the same underlying stack, Any23 could
be employed in every projects needing a reliable
- framework to access structured semantic markup. Any23 core could be easily released also
as a 
- [[http://wiki.apache.org/nutch/PluginCentral|Apache Nutch Plugin]] and then, used to handy
fill [[http://www.openrdf.org/doc/sesame2/system/ch05.html|SAIL-compliant]] triple stores.

  
- == A Excessive Fascination with the Apache Brand ==
+ == An Excessive Fascination with the Apache Brand ==
+ Even if the Any23 community recognizes the power and the attractiveness  of the ASF brand,
we are absolutely aware of our already established role in the wider Semantic Web developers
community. Any23 already proved its reliability in closely support all the new specifications
coming  from the Microformats communities, our major contributors in term of  opened issues
about new feature requests. Furthermore, we are convinced that we can enthusiastically bring
inside the ASF new and fresh energies in order to improve our visions, insights and knowledge
about the other  projects and, most important, to have the possibility of enlarge our small
 community with talented and passionate developers.
- 
- Even if the Any23 community recognizes the power and the attractiveness 
- of the ASF brand, we are absolutely aware of our already established role
- in the wider Semantic Web developers community. Any23 already proved
- its reliability in closely support all the new specifications coming 
- from the Microformats communities, our major contributors in term of 
- opened issues about new feature requests. Furthermore, we are convinced
- that we can enthusiastically bring inside the ASF new and fresh energies
- in order to improve our visions, insights and knowledge about the other 
- projects and, most important, to have the possibility of enlarge our small 
- community with talented and passionate developers. 
  
  = Documentation =
  Any23 Documentation
+ 
   1. [[http://developers.any23.org/|Any23 Project Homepage]]
-  2. [[http://code.google.com/p/any23/|Any23 Developer Homepage]]
+  1. [[http://code.google.com/p/any23/|Any23 Developer Homepage]]
-  3. [[http://any23.org/|Any23 Live Demo]]
+  1. [[http://any23.org/|Any23 Live Demo]]
-  
+ 
  Any23 Related Specifications
+ 
   1. [[http://www.w3.org/RDF/|RDF]]
-  2. [[http://www.w3.org/TR/html5/|HTML5]]
+  1. [[http://www.w3.org/TR/html5/|HTML5]]
-  3. [[http://www.w3.org/TR/rdfa-syntax/|RDFa]]
+  1. [[http://www.w3.org/TR/rdfa-syntax/|RDFa]]
-  4. [[http://www.w3.org/TR/microdata/|Microdata]]
+  1. [[http://www.w3.org/TR/microdata/|Microdata]]
-  5. [[http://microformats.org/|Microformats]]
+  1. [[http://microformats.org/|Microformats]]
-  6. [[http://www.w3.org/TR/rdf-syntax-grammar/|RDF/XML]]
+  1. [[http://www.w3.org/TR/rdf-syntax-grammar/|RDF/XML]]
-  7. [[http://www.w3.org/TeamSubmission/turtle/|Turtle]]
+  1. [[http://www.w3.org/TeamSubmission/turtle/|Turtle]]
-  8. [[http://www.w3.org/TR/rdf-testcases/#ntriples|N-Triples]]
+  1. [[http://www.w3.org/TR/rdf-testcases/#ntriples|N-Triples]]
-  9. [[http://sw.deri.org/2008/07/n-quads/|N-Quads]]
+  1. [[http://sw.deri.org/2008/07/n-quads/|N-Quads]]
  
  Any23 Other documentation
+ 
   1. [[http://www.slideshare.net/dpalmisano/distilling-the-web-of-data-drop-by-drop-with-java|Any23
presentation on Slideshare]]
  
  = Initial Source =
@@ -151, +91 @@

  
  = External Dependencies =
  All the external dependencies (and their licenses) used by Any23 follows:
+ 
   * [[http://nekohtml.sourceforge.net/|Nekohtml]] (Apache 2.0)
   * [[http://www.openrdf.org|OpenRDF Sesame]] (BSD-style license)
   * [[http://jetty.codehaus.org/jetty/|Jetty]] (Apache License 2.0 and Eclipse Public License
1.0)
@@ -200, +141 @@

   * TBD - [[TBD, hopefully Tika]]
  
  = Other interested people (in alphabetical order) =
- 
   * Lewis John McGibbney <lewismc at apache dot org>
  

---------------------------------------------------------------------
To unsubscribe, e-mail: cvs-unsubscribe@incubator.apache.org
For additional commands, e-mail: cvs-help@incubator.apache.org


Mime
View raw message