lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tommaso Teofili (JIRA)" <j...@apache.org>
Subject [jira] Commented: (SOLR-2129) Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA
Date Tue, 26 Oct 2010 14:04:19 GMT

    [ https://issues.apache.org/jira/browse/SOLR-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12924971#action_12924971
] 

Tommaso Teofili commented on SOLR-2129:
---------------------------------------

Hi Grant, I think it would be great to have Mahout classifiers inside Solr :)

I like your suggestion at point 1. 
I can change the current hardcoded mapping mechanism using instead a simple mapping between
UIMA extracted types/features and field names defined inside solrconfig.xml.

A different option could be to develop a SolrCASConsumer component in UIMA (similar to Lucas
[1], Lucene CAS Consumer) providing full control on how UIMA annotations and features can
be mapped to Solr fields, but on UIMA side ;)

Regarding point 2 the jars are already under contrib/uima/lib so I can modify the sample solrconfig.xml
adding the proper <lib> tag.
Thanks for your comments and suggestions.

[1] : https://svn.apache.org/repos/asf/uima/sandbox/trunk/Lucas

> Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA
> -------------------------------------------------------------------------------
>
>                 Key: SOLR-2129
>                 URL: https://issues.apache.org/jira/browse/SOLR-2129
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Tommaso Teofili
>            Assignee: Robert Muir
>         Attachments: lib-jars.zip, SOLR-2129-asf-headers.patch, SOLR-2129.patch
>
>
> Provide components to enable Apache UIMA automatic metadata extraction to be exploited
when indexing documents.
> The purpose of this is to get unstructured information "inside" a document and create
structured metadata (as fields) to enrich each document.
> Basically this can be done with a custom UpdateRequestProcessor which triggers UIMA while
indexing documents.
> The basic UIMA implementation of UpdateRequestProcessor extracts sentences (with a tokenizer
and an hidden Markov model tagger), named entities, language, suggested category, keywords
and concepts (exploiting external services from OpenCalais and AlchemyAPI). Such an implementation
can be easily extended adding or selecting different UIMA analysis engines, both from UIMA
repositories on the web or creating new ones from scratch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message