lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benson Margulies (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module
Date Tue, 12 Nov 2013 11:35:18 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820031#comment-13820031
] 

Benson Margulies commented on LUCENE-2899:
------------------------------------------

I know of an NER model that looks at the entire text to bias towards consistent tagging of
entities in larger units. However, I agree that crocks are bad. Perhaps this is an opportunity
to think about how to expand the analysis protocol to support this sort of thing more smoothly?

It would be desirable if this integration were to start with a set of Token Attributes that
could be used in any number of analysis components, inside or outside of Lucene, that were
in a position to deliver similar items. I suppose I'm late to ask for this, as the UIMA component
must pose the same question.

In some languages, NER is very clumsy as a token filter, because entities don't obey token
boundaries very well. Also, in my experience, entities aren't useful as additional tokens
in the same field as their source text, but rather in their own field (where they can be facetted
upon, for example). Is there any appetite to look at Lucene support for a stream that delivers
to more than one field? Or is there such a thing and I've missed it?

I agree with Rob about UIMA because I think that Lucene analysis attributes are a weak data
model for interconnecting NLP modules and flowing data through them -- and one frequently
needs to do that.



> Add OpenNLP Analysis capabilities as a module
> ---------------------------------------------
>
>                 Key: LUCENE-2899
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2899
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 4.6
>
>         Attachments: LUCENE-2899-RJN.patch, LUCENE-2899.patch, OpenNLPFilter.java, OpenNLPTokenizer.java
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice to have a
submodule (under analysis) that exposed capabilities for it. Drew Farris, Tom Morton and I
have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it would have
to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as either payloads
(PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message