uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christof Mueller <muel...@tk.informatik.tu-darmstadt.de>
Subject Re: Lucene cas consumer
Date Sat, 06 Dec 2008 02:29:54 GMT
Jörn Kottmann wrote:
>> The "problem", that's the UIMA's power,  is that everyone has it's own
>> type system.
>> To produce a lucene document one extract information from some
>> features, applying the right analyzer. In my case I use maybe only 10%
>> of the annotation produced by the analysis pipeline to produce a
>> single lucene doc.
>> So we need a very highly configurable component, able to map only
>> certain declared features and applying the right analyzer and so on.
>> Mny ways are possible:
>> -completly programmatic: the indexer is abstract and should be
>> extended to implement the right mapping for a specialized typeSytem
>> and pipeline
>> -configurable: mapping rules are defined in a descriptor file; the
>> JENA component followed this way
> I prefer mapping rules in the descriptor. These rules have to be
> adjusted by many users to make them compatible with
> their type system. Hard coding the mapping rules makes
> this task more difficult.
> As far as I know was this approach also chosen by the
> regex annotator in the sandbox.

Another approach would be to use an additional annotator for mapping
type systems. The annotator would take tokens, stems, named entities or
what ever you want to index and map them on annotations of a certain
type, e.g., IndexTerm, which would be indexed by the consumer. During
the mapping process, the annotator could also perform some kind of
filtering by taking part-of-speech or stop word annotations into account.
Keeping the mapping and filtering separate from the indexing process
would make it easier to switch to a different search engine framework.

The disadvantage is that users need to write their own annotator for
doing the mapping and filtering. So maybe this approach could be
combined with Jörn's suggestion of using mapping rules in the descriptor
in a similar way as the regex annotator does. I think if you make the
mapping and filtering process in the annotator configurable in a way
that the user does not have to write any code, you would get a component
that could be quite useful not only for creating search engine indexes,
but for other tasks as-well.


Christof Müller
Technische Universität Darmstadt

View raw message