Jörn Kottmann wrote: >> The "problem", that's the UIMA's power, is that everyone has it's own >> type system. >> To produce a lucene document one extract information from some >> features, applying the right analyzer. In my case I use maybe only 10% >> of the annotation produced by the analysis pipeline to produce a >> single lucene doc. >> So we need a very highly configurable component, able to map only >> certain declared features and applying the right analyzer and so on. >> Mny ways are possible: >> -completly programmatic: the indexer is abstract and should be >> extended to implement the right mapping for a specialized typeSytem >> and pipeline >> -configurable: mapping rules are defined in a descriptor file; the >> JENA component followed this way > > I prefer mapping rules in the descriptor. These rules have to be > adjusted by many users to make them compatible with > their type system. Hard coding the mapping rules makes > this task more difficult. > > As far as I know was this approach also chosen by the > regex annotator in the sandbox. Another approach would be to use an additional annotator for mapping type systems. The annotator would take tokens, stems, named entities or what ever you want to index and map them on annotations of a certain type, e.g., IndexTerm, which would be indexed by the consumer. During the mapping process, the annotator could also perform some kind of filtering by taking part-of-speech or stop word annotations into account. Keeping the mapping and filtering separate from the indexing process would make it easier to switch to a different search engine framework. The disadvantage is that users need to write their own annotator for doing the mapping and filtering. So maybe this approach could be combined with Jörn's suggestion of using mapping rules in the descriptor in a similar way as the regex annotator does. I think if you make the mapping and filtering process in the annotator configurable in a way that the user does not have to write any code, you would get a component that could be quite useful not only for creating search engine indexes, but for other tasks as-well. Christof -- Christof Müller UKP Lab Technische Universität Darmstadt http://www.ukp.tu-darmstadt.de