Coming late to the conversation... Just offering some Lucene
perspective
On Dec 4, 2008, at 1:36 PM, Niels Ott wrote:
> What Lucene cannot do - or at least not without a lot of hacking - is
> aggregating analyses as UIMA can using the CAS. Usually your knowledge
> grows during an UIMA-based NLP-pipeline: you add the a token
> annotation,
> a lemma annotation, a POS-annotation and so on... In Lucene, you have
> the classical pipeline: the output replaces the input. (Yes, by
> subclassing Lucene's "Token" class, one can fiddle around the issue,
> but
> it is not elegant at all.)
>
You might find the TeeTokenFilter and SinkTokenizer interesting for
mapping/aggregating tokens/extractions out to other fields in Lucene.
Also, Lucene is getting more flexible in terms of indexing and
searching. You can attach payloads to terms (i.e. byte arrays) which
can provide some crude annotation storage and https://issues.apache.org/jira/browse/LUCENE-1422
and a couple of other issues are the start of more flexibility to
add attributes that can then be indexed. We're still working on the
search side of it, but I think you will see more in the way of
flexible indexing in the coming months that should be a nice win for
UIMA + Lucene users.
> What makes Lucene + UIMA interesting for me is a simple fact: I can do
> all the NLP I want and be as flexible as I need in UIMA. Then I can
> feed
> the outcome (or rather: a small part of it) into a Lucene index.
>
> In my special case, I'm not using a CAS Consumer, but I can imagine
> other people would appreciate it in their application scenarios.
>
> To conclude: Lucene and UIMA aren't competitors, but in some cases
> having one feeding the other is what you want.
Couldn't agree more!
Cheers,
Grant
|