lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tommaso Teofili (JIRA)" <>
Subject [jira] Commented: (SOLR-2129) Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA
Date Fri, 05 Nov 2010 07:53:42 GMT


Tommaso Teofili commented on SOLR-2129:

bq. Try to reuse the same syntax as the mapping in the ExtractingRequestHandler.

Inside <uimaConfig> there are many possible ways that configuration can be defined.
Let's say we want to map the feature 'text' of type 'ConceptFS' on the field 'concept', I
thought 3 options, listed here

1. exactly same syntax as ExtractingRequestHandler, though Solr-UIMA is not a RequestHandler
but an UpdateRequestProcessor; could this create confusion?
   <lst name="defaults">
      <str name="">concept</str>

2. define the feature of a type to map over a field with one tag
    <map field="concept" feature="org.apache.uima.alchemy.ts.categorization.ConceptFS@text"/>

3. have  a more hierarchical and strict structure, though not so immediate to understand and
maybe easier for UIMA experts
    <type name="org.apache.uima.alchemy.ts.categorization.ConceptFS">
      <feature name="text">concept</feature>

What do you think?
Thanks for any advice,

> Provide a Solr module for dynamic metadata extraction/indexing with Apache UIMA
> -------------------------------------------------------------------------------
>                 Key: SOLR-2129
>                 URL:
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Tommaso Teofili
>            Assignee: Robert Muir
>         Attachments:, SOLR-2129-asf-headers.patch, SOLR-2129.patch
> Provide components to enable Apache UIMA automatic metadata extraction to be exploited
when indexing documents.
> The purpose of this is to get unstructured information "inside" a document and create
structured metadata (as fields) to enrich each document.
> Basically this can be done with a custom UpdateRequestProcessor which triggers UIMA while
indexing documents.
> The basic UIMA implementation of UpdateRequestProcessor extracts sentences (with a tokenizer
and an hidden Markov model tagger), named entities, language, suggested category, keywords
and concepts (exploiting external services from OpenCalais and AlchemyAPI). Such an implementation
can be easily extended adding or selecting different UIMA analysis engines, both from UIMA
repositories on the web or creating new ones from scratch.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message