Return-Path: Delivered-To: apmail-incubator-general-archive@www.apache.org Received: (qmail 70359 invoked from network); 6 Jan 2010 19:42:14 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 6 Jan 2010 19:42:14 -0000 Received: (qmail 82963 invoked by uid 500); 6 Jan 2010 19:42:13 -0000 Delivered-To: apmail-incubator-general-archive@incubator.apache.org Received: (qmail 82768 invoked by uid 500); 6 Jan 2010 19:42:13 -0000 Mailing-List: contact general-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@incubator.apache.org Delivered-To: mailing list general@incubator.apache.org Delivered-To: moderator for general@incubator.apache.org Received: (qmail 31699 invoked by uid 99); 6 Jan 2010 18:57:50 -0000 X-ASF-Spam-Status: No, hits=0.8 required=10.0 tests=FUZZY_MERIDIA,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of gsiasf@gmail.com designates 209.85.210.193 as permitted sender) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:sender:from:content-type :content-transfer-encoding:subject:date:message-id:to:mime-version :x-mailer; bh=Io+hzhc+fUxN9BYe9zaH4nZbNRnuazHefG2JEJIkevI=; b=QDBWz0Ns9BSMiLr78XurptsVfHVlylqmefYrkSgRJkp/nj/0LcFK0n1nO4MD9Jhi4t 9ACsaS1QhKOXPg3RpRk917CyPXKzZSpc99kMaD5N5bO+2J2l7av/cuVGTckABZ+XZYxY IMuCAEwwiZViaiCJUWVxBeU2xrdGx6dbIvJoY= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=sender:from:content-type:content-transfer-encoding:subject:date :message-id:to:mime-version:x-mailer; b=PZxOAEBJhaSbUbh9f54HUmMYyyDvqawLVvy3El+k+Ptlk13zOzp9ttQACbseVRH+gB /HjOPQNmea+SDRTXMxMu/HvXL4oKD21kTCgLemezgZ/NOgRcrDMmiBbvbX5AQkJ0mxKO fNvI97AzQe+whXhpE4oXujdYcG7cczGF99pp8= Sender: Grant Ingersoll From: Grant Ingersoll Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Subject: [PROPOSAL] Lucene Connector Framework Date: Wed, 6 Jan 2010 13:57:14 -0500 Message-Id: <2E4F5A2E-B77A-4E60-A7E8-46BABA6D145C@apache.org> To: general@incubator.apache.org Mime-Version: 1.0 (Apple Message framework v1077) X-Mailer: Apple Mail (2.1077) X-Virus-Checked: Checked by ClamAV on apache.org Hi, On behalf of the Lucene PMC, I'd like to propose incubation for a new = Lucene subproject called the Lucene Connector Framework. I think we = have all the necessary bits in place for the proposal to go forward, but = would appreciate any discussion/questions. Thanks, Grant Ingersoll Proposal: = http://wiki.apache.org/incubator/LuceneConnectorFrameworkProposal ------ Wiki Text Copied Below ----- Lucene Connector Framework Abstract Many, many search engines, as well as other applications, have a need to = connect with content repositories (SharePoint, CMS, Documentum, etc.) in = a standard manner. The Lucene Connector Framework (LCF) is a project = aimed at building out these connectors in open source under the Apache = brand. Proposal The goal of LCF is to create a viable Lucene subproject aimed at = delivering a best of breed connector framework under the Apache Lucene = name. As a framework, the project will not only provide a way to connect = to individual repositories, but also a mechanism for plugging in new = connectors or custom connectors in a straightforward manner. A connector framework is vital for search engines and other tools that = need to access data located in corporate repositories. By abstracting = the problem into a framework, applications can code to a set of = well-defined interfaces instead of having to use a different interface = for each connector. Connector Framework is an extendible incremental crawler, which uses a = database to manage configuration and crawl history, and provides = reasonably high performance in accessing content in multiple = repositories for the main purpose of search engine indexing. Connector = Framework also establishes a repository-specific security model which = can be used to limit search user access to repository content based on a = user's identity. Connector Framework also includes existing connectors = and authorities for: =95 File system =95 Windows shares =95 JDBC-supported databases =95 RSS feeds =95 General websites =95 LiveLink [from OpenText] =95 Documentum [from EMC] =95 SharePoint [from Microsoft] =95 Meridio [from Meridio] =95 Memex [from Memex] =95 FileNet [from IBM] Key design points for Connector Framework are as follows: =95 Extendability - you can add new connectors for new = repositories, and new authorities for specific repository security = models =95 Incrementality - the ability to process only what changed = between crawls, in a repository-specific manner =95 Restartability - using a database with ACID properties to = insure that crawls are safe against process interruption or machine = shutdown =95 Security - establishing a model of security tokens that = allows a search engine to enforce a repository's security model =95 Limited footprint - ability to operate reliably within a = fixed amount of process memory, regardless of configuration =95 Performance - management of connector-specific resources to = maximize overall thoughput =95 Transparency - ability to generate reports on the activity = of all crawls and repository connections Background MetaCarta originally approached Grant Ingersoll from the Lucene PMC = about donating their existing connector framework to the Lucene PMC. = After some discussion about accepting it as a software grant, the PMC = decided it would be best to incubate the project first. Rationale The Connector Framework fills an often significant gap in the Lucene = experience, namely, how to get content locked away in a content = repository into Lucene/Solr/Nutch/Mahout/Tika. Naturally, many other = tools (search engines and others) will also have this same problem. A = Connector Framework would also be useful for someone wishing to migrate = between content repositories, too. Current Status Connector Framework has been under development and in use in the field = for close to five years, deployed on a MetaCarta search appliance. = Almost all development of the project has been done by Karl Wright ( = kwright@metacarta.com ). Some individual connectors were developed = initially by contractors hired by MetaCarta, Inc., but maintenance and = further development is currently handled by the MetaCarta team. Development of Connector Framework can therefore be viewed as core = framework development, plus development of individual connectors. Core = framework development is currently not a terribly collaborative process, = as there are no maintainers of the core functionality other than Mr. = Wright. Development of new connectors has been done in the past in a = much more collaborative way by supplying a developer with a "development = kit", and then integrating the resulting connector (with whatever = changes might have been necessary) into the source tree. Reasonable efforts have been made to maintain the generality of the code = base during the time that MetaCarta has owned it. Nevertheless, certain = MetaCarta-specific changes have been made which may require review and = modification. The following areas probably need to be addressed in the = code before graduation can occur: =95 Branding. The UI brands it as a MetaCarta project. =95 Package names. Package names would have to be changed. =95 How Connector Framework handles document delivery needs to = be generalized, at least for a single, configurable target output = connector, and perhaps for multiple, independently-configurable targets. = Simple example output connectors need to be written. Work in this = direction is currently underway at MetaCarta and may or may not be = complete at the time of the code handover. =95 Connector Framework-specific dependent package modifications = need to be addressed somehow. For instance, the following projects that = Connector Framework depends upon have been modified, but the = modifications have not been accepted upstream: commons-httpclient NTLMv2 = and NTLM2 support [RSS, Web, SharePoint, Meridio, and Livelink = connectors]; commons-httpclient custom HTTPS protocol factory support = [Web, SharePoint, Meridio, and Livelink connectors]; xerces ability to = handle non-legal RSS feeds [RSS and Web connectors] =95 MetaCarta-specific features, like document templates, are = explicitly handled by the UI and the infrastructure. These features = should be generalized so that they are controlled by the choice of = output connector. =95 Some specific hooks, namely support for configuration change = notification, and for database maintenance notification, may need to be = made more generic. =95 Share Connector has a "fingerprinting" feature, which = prefilters documents based on a document type it surmises using a = document inspection technique. This feature is only viable at the moment = for very basic document types. It should either be removed, or = generalized significantly to be much more flexible. =95 Documentation needs to be fleshed out, including javadoc and = overall usage documents. =95 Tests need to be written and/or ported from MetaCarta's test = suite. Longer term, the project will likely grow into a more distributed = crawler, where multiple machines might well be involved in coordinated = crawling activity. Meritocracy Building the community using a meritocratic approach is very important = to the success of LCF. We know many, many people in the search space = (and otherwise) have either written their own connectors or are in need = of connectors. Thus, we expect a meritocratic community will lead to = widespread participation. Community Our hope is that our existing code, features and capabilities will = attract a large community of both developers and users. We also believe = that other organizations will find this project interesting and = relevant, and contribute resources. The user community of LCF would be similar to that of the other Lucene = projects, and in many cases they would overlap. Core Developers See the initial committer list below. Alignment We expect LCF will align quite well with the existing Lucene community = and will also provide significant value to other ASF and non-ASF = projects as well as many companies and individuals looking to access = their content repositories in a programmatic fashion. Known Risks Orphaned Products The Connector Framework is an important piece of any search engine, = including MetaCarta's, as it provides the primary mechanism for getting = content out of a repository and into the search engine's index. Thus, we = don't expect it will be orphaned anytime soon. Once the project is = established and the code is available, we expect to attract not only = other search companies, but others with similar needs. Inexperience with Open Source Grant Ingersoll, Ryan McKinley and Simon Willnauer provide the majority = of the experience with Open Source at the ASF, but all of the initial = committers are familiar with Open Source and have contributed to other = open source projects. Homogeneous Developers The current list of committers are mostly members of either the = MetaCarta or Lucid Imagination developer team, but several are not. = Additionally, we are actively recruiting other developers. Reliance on Salaried Developers We have a variety of committers represented. Some are being paid to work = on the project and some are not. Cryptography Connector Framework itself has no real cryptography component, although = it does currently obfuscate passwords it saves to the database or to a = configuration file using a proprietary algorithm. The algorithm is = present simply to avoid using cleartext and is not secure in any sense = other than by obscurity. Various connectors, such as Share Connector, Web Connector, RSS = Connector, SharePoint Connector, LiveLink Connector, and Meridio = Connector make use of cryptographic principles via secondary libraries. = Specifically, these connectors support NTLM, NTLMv2, and NTLM2 Session = authentication via commons-httpclient and jCIFS. The changes to = commons-httpclient necessary to support these varieties of Windows = protocols have not yet been accepted upstream by the Apache httpclient = project. It is unknown at this time exactly to what degree the Oracle JDBC = driver, the jtds JDBC driver, or the Postgresql JDBC driver uses = cryptography. Also, the FileNet API class, the Memex API classes, the = OpenText LAPI api classes, and the Documentum DFC classes all may or may = not use cryptography. Legal Concerns Some of the connectors in the existing framework require paid licenses = to use. We will need to evaluate each connector to see what can be = appropriately included. For those connectors that require a paid = license, we will need to determine a plan for including the wrapper code = without the underlying bindings in a legal manner. We expect we can = provide the wrapper code without the binding and that the code will thus = only be compilable by someone who has access to the binding. (This is = what Google has done for their individual connectors). Longer term, we = expect to demonstrate to the companies with proprietary connectors why = it is more valuable for them to open up their specific connector pieces = to give broader access to people looking to leverage their content in = the repository. Trademark The project is being rebranded from a MetaCarta internal name to the = Lucene Connector Framework, which will be an ASF mark. Relationships with Other Apache Products We expect almost all of the Apache Lucene ecosystem will benefit from = having a standard way of connecting to content repositories. = Additionally, users of UIMA should also benefit. We also see an = especially tight connection with Tika, as much of the content in these = types of repositories are "rich" document types which will then need = their content extracted. An Excessive Fascination with the Apache Brand All of us are familiar with the value that Apache brings to a project in = building out a community. We also are all significant users of Apache = Lucene and related tools (Solr, Nutch, Mahout, Tika) and expect a close = relationship with those projects will help significantly grow the LCF = community. Documentation MetaCarta has end-user documentation for Lucene Connector Framework, = which might function as the core the open-source end-user documentation. = The documentation is in LaTeX form, and thus usable sources can readily = be extracted. Research as to any ownership issues for the documentation = as it stands still needs to be examined. The existing java doc of the code, while fairly extensive, needs review = and perhaps augmentation to insure it meets the needs of an ASF project. = Significant attention to maintaining its accuracy was made during = MetaCarta's ownership of the code base. Initial Source All initial sources will be coming from MetaCarta, Inc., with the goal = of folding in changes from others shortly thereafter. Source and Intellectual Property Submission Plan Code IP grants need to be made from MetaCarta, Inc. But, in addition, = several connectors (notably Documentum, LiveLink, Memex, and FileNet) = rely directly on client API's in order to be compiled. Another connector = (JDBC) relies on the existence of the Oracle JDBC Driver in the = classpath in order to enable crawls against Oracle databases. It is unlikely that EMC, OpenText, Memex, or IBM would grant = Apache-license-compatible use of these client libraries. Thus, the = expectation is that users of these connectors obtain the necessary = client libraries from the owners prior to building or using the = corresponding connector. An alternative would be to undertake a = clean-room implementation of the client API's, which may well yield = suitable results in some cases (LiveLink, Memex, FileNet), while being = out of reach in others (Documentum). Conditional compilation, for the = short term, is thus likely to be a necessity. Other external dependencies, such as jCIFS for the Share Connector, are = licensed with LGPL, and thus may need to be treated in a manner similar = to the closed API's even though they are open source. These include the = postgresql JDBC driver, and JTDS. The Lucene Connector Framework core and individual connectors are = completely separable, and many of the connectors require no third party = licenses. Therefore, there is significant utility for this project even = in the absence of any third-party software grants, or clean-room = engineering. The software grant will be faxed to the Apache Software Foundation if = and when the proposal herein described is accepted. MetaCarta patents = are not infringed by this grant. Also, MetaCarta trademarks are not = included in this grant. External Dependencies The project dependencies, other than on other Apache projects, are as = follows: The ConnectorFramework core currently uses the Bitmechanic JDBC pool = driver, which is BSD licensed, and the Postgresql JDBC driver, which is = also BSD licensed. The LiveLink Connector relies on LAPI, which is privately licensed by = OpenText. The Documentum Connector relies on DFC, which is privately = licensed by EMC. The Share Connector relies on jCIFS, which is LGPL. The = Memex Connector relies on privately licensed java libraries from Memex. = The FileNet Connector relies on privately licensed java libraries from = IBM. Required Resources =95 Mailing lists =95 connectors-private (with moderated subscriptions) =95 connectors-user@ =95 connectors-dev@ =95 connectors-commit@ =95 Subversion directory =95 = https://svn.apache.org/repos/asf/incubator/connectors =95 Website =95 Confluence (CONNECTORS) =95 Issue Tracking =95 JIRA (CONNECTORS) Initial Committers Names of initial committers with affiliation and current ASF status: =95 Karl Wright (kwright at metacarta) =95 Josiah Strandberg (jstrandberg at metacarta) =95 Ken Baker (bakerkj at metacarta) =95 Marc Meadows (mam at metacarta) =95 Grant Ingersoll ( gsingers@a.o Lucid Imagination, ASF = Member) =95 Brian Pinkerton (brian.pinkerton at Lucid Imagination) =95 Simon Willnauer (simonw at apache org, Committer on Lucene = Java and Lucene Open Relevance Project) =95 Ryan McKinley (ryan at apache org, Committer on Lucene and = Solr) =95 Robert Muir (rmuir at apache org, Committer on Lucene and = Open Relevance) =95 Sami Siren ( siren@a.o , Committer on Nutch and Tika) =95 Otis Gospodnetic ( otis@a.o , Committer on Lucene, Solr, = Nutch, Mahout, and Open Relevance Project) =95 Shalin Shekhar Mangar ( shalin@a.o , AOL, Committer on = Apache Solr) =95 Noble Paul ( noble@a.o , AOL, Committer on Apache Solr) =95 George Aroush (george at aroush.net, Committer on = Lucene.Net) Sponsors Champion =95 Grant Ingersoll Nominated Mentors =95 Grant Ingersoll =95 Jukka Zitting =95 Gianugo Rabellino Sponsoring Entity =95 Apache Lucene PMC: Message ID: = AF7E5AD4-E208-4866-8D29-6ADE19212F45@gmail.com in private@lucene.a.o --------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org For additional commands, e-mail: general-help@incubator.apache.org