Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@lucene.apache.org
Message-ID: <25575584.99451288187540577.JavaMail.jira@thor>
Date: Wed, 27 Oct 2010 09:52:20 -0400 (EDT)
From: "Tommaso Teofili (JIRA)" <jira@apache.org>
To: dev@lucene.apache.org
Subject: [jira] Commented: (SOLR-2129) Provide a Solr module for dynamic
 metadata extraction/indexing with Apache UIMA
In-Reply-To: <13511697.335581285134934066.JavaMail.jira@thor>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12925371#action_12925371 ] 

Tommaso Teofili commented on SOLR-2129:
---------------------------------------

bq. Try to reuse the same syntax as the mapping in the ExtractingRequestHandler.

ok, I added the <lib> tag and will commit a new patch when I'm finished with these changes

bq. I've been struggling with these kinds of questions a lot lately. That is, the marriage of two projects. Where should the code go? Setting up another ASF project is a pain in the amount of hoops to jump through. Apache Labs doesn't cut it for a number of reasons. Hosting on Github or Google Code is OK, but loses the ASF community aspect. Sigh.

I agree with your point; I don't think it's easy to come with a final good and general answer for such situations.

What comes to my mind to solve it generally is establishing a single wide-purpose ASF project which contains integrations between many different ASF projects, this could be good to prepare the base for two projects that want to "marry" but it could be too much general and maybe not easy to maintain from a community point of view (e.g.: should all the Lucene committers commit on "integrations" project too only because someone integrated it with UIMA?); another option could be to force two marrying projects to respect a standard (e.g. CMIS) so that developing a specialized "connector" wouldn't be needed anymore but I don't think it's always possible to do so since it could require a huge effort.

In this particular case, in my opinion, the code should go into the proper project depending on which "pipeline" is being changed/enhanced. Therefore since in this Solr-UIMA integration we're adding a step to the Solr indexing process via an UpdateRequestProcessor I think it should be part of Solr codebase whereas since in the SolrCASConsumer we'd be adding a (final) Consumer to the UIMA pipeline that should be part of UIMA codebase.


> Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA
> -------------------------------------------------------------------------------
>
>                 Key: SOLR-2129
>                 URL: https://issues.apache.org/jira/browse/SOLR-2129
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Tommaso Teofili
>            Assignee: Robert Muir
>         Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, SOLR-2129.patch
>
>
> Provide components to enable Apache UIMA automatic metadata extraction to be exploited when indexing documents.
> The purpose of this is to get unstructured information "inside" a document and create structured metadata (as fields) to enrich each document.
> Basically this can be done with a custom UpdateRequestProcessor which triggers UIMA while indexing documents.
> The basic UIMA implementation of UpdateRequestProcessor extracts sentences (with a tokenizer and an hidden Markov model tagger), named entities, language, suggested category, keywords and concepts (exploiting external services from OpenCalais and AlchemyAPI). Such an implementation can be easily extended adding or selecting different UIMA analysis engines, both from UIMA repositories on the web or creating new ones from scratch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org