lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tommaso Teofili (JIRA)" <j...@apache.org>
Subject [jira] Commented: (SOLR-2129) Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA
Date Wed, 27 Oct 2010 13:52:20 GMT

    [ https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12925371#action_12925371
] 

Tommaso Teofili commented on SOLR-2129:
---------------------------------------

bq. Try to reuse the same syntax as the mapping in the ExtractingRequestHandler.

ok, I added the <lib> tag and will commit a new patch when I'm finished with these changes

bq. I've been struggling with these kinds of questions a lot lately. That is, the marriage
of two projects. Where should the code go? Setting up another ASF project is a pain in the
amount of hoops to jump through. Apache Labs doesn't cut it for a number of reasons. Hosting
on Github or Google Code is OK, but loses the ASF community aspect. Sigh.

I agree with your point; I don't think it's easy to come with a final good and general answer
for such situations.

What comes to my mind to solve it generally is establishing a single wide-purpose ASF project
which contains integrations between many different ASF projects, this could be good to prepare
the base for two projects that want to "marry" but it could be too much general and maybe
not easy to maintain from a community point of view (e.g.: should all the Lucene committers
commit on "integrations" project too only because someone integrated it with UIMA?); another
option could be to force two marrying projects to respect a standard (e.g. CMIS) so that developing
a specialized "connector" wouldn't be needed anymore but I don't think it's always possible
to do so since it could require a huge effort.

In this particular case, in my opinion, the code should go into the proper project depending
on which "pipeline" is being changed/enhanced. Therefore since in this Solr-UIMA integration
we're adding a step to the Solr indexing process via an UpdateRequestProcessor I think it
should be part of Solr codebase whereas since in the SolrCASConsumer we'd be adding a (final)
Consumer to the UIMA pipeline that should be part of UIMA codebase.


> Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA
> -------------------------------------------------------------------------------
>
>                 Key: SOLR-2129
>                 URL: https://issues.apache.org/jira/browse/SOLR-2129
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Tommaso Teofili
>            Assignee: Robert Muir
>         Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, SOLR-2129.patch
>
>
> Provide components to enable Apache UIMA automatic metadata extraction to be exploited
when indexing documents.
> The purpose of this is to get unstructured information "inside" a document and create
structured metadata (as fields) to enrich each document.
> Basically this can be done with a custom UpdateRequestProcessor which triggers UIMA while
indexing documents.
> The basic UIMA implementation of UpdateRequestProcessor extracts sentences (with a tokenizer
and an hidden Markov model tagger), named entities, language, suggested category, keywords
and concepts (exploiting external services from OpenCalais and AlchemyAPI). Such an implementation
can be easily extended adding or selecting different UIMA analysis engines, both from UIMA
repositories on the web or creating new ones from scratch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message