manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From chalitha udara Perera <chalithaud...@gmail.com>
Subject Re: Contributing OpenNLP connector
Date Thu, 19 Nov 2015 05:45:42 GMT
Hi guys,

Thank you very much for comments and suggestions !

As Alessandro said, I have assumed the use of Tika connector prior to using
the OpenNLP connector.
I think it is a valid assumption because tika parses different sources in
to common format, so the future
transformation connectors can largely benefit from the use of tika in the
connectors chain.

Regarding the language issue, currently I implemented it to work with
English language content.
But I agree with Alessandro and connector can be made to support different
languages. Currently OpenNLP
has following models [1]. NER models are available for en, es and nl
languages. but it is possible to train models
for other languages as well.

Tika can be used to detect language from document (As far as I know Stanbol
does that), If we assumed the use of tika connector before OpenNLP
connector, we can use language to direct to correct model. In this case we
have to download all the NER models and
reference them in code.
Please give your suggestions on how language support should be included in
the OpenNLP connector

Thanks,
Chalitha

[1] http://opennlp.sourceforge.net/models-1.5/

On Wed, Nov 18, 2015 at 9:21 PM, Karl Wright <daddywri@gmail.com> wrote:

> There's another problem with:
>
> String textContent = new String(bytes);
>
> Specifically, (1) its operation will vary with the locale of the machine
> it's being run on, and (2) there's no limit to the amount of memory that
> this could conceivably require.  Both are problems.  If you could use a
> stream you would be much better off.
>
> Karl
>
>
> On Wed, Nov 18, 2015 at 9:20 AM, Alessandro Benedetti <
> abenedetti@apache.org
> > wrote:
>
> > Hey Chal,
> > First of all thanks you very much for the contribution!
> > I have some observations :
> >
> > *Model Downloading*
> >
> > Taking the look to the way you provide the user with the models, I can
> see
> > there is a shell script to download very specific english models.
> > It would be great having the possibility to configure the model to use in
> > the connector config UI .
> > In particular I see two possibilities :
> > 1) you provide a select list per model required and then automatically
> you
> > download the model and install it
> > 2) you provide the user with the possibility of uploading the model
> he/she
> > wants to use ( more flexible, but the user will need to download a model
> on
> > his own)
> > In my opinion is really important to keep the transformation connector
> > flexible, able to work with different languages and models.
> >
> > *Text enrichment*
> > Taking a look to the code I see in here a really strong assumption :
> >
> > String textContent = new String(bytes);
> >
> > This means you assume the only input possible is plain text.
> > Actually as we know we have the binary there, not necessary a plain
> string.
> > I think we need to specify the Tika Transformer to be a requirement for
> > this connector.
> > Furthermore I would suggest the possibility for the user to select the
> list
> > of input fields to be considered to be the source of the extraction.
> >
> > e.g.
> > I can configure my extraction to happen from title,text and description.
> >
> > Of course it is required a Transformer Connector to happen before the
> > OpenNLP one, to provide those fields.
> > These are quick considerations after a first look to the code, happy to
> > discuss and help further :)
> >
> > Cheers
> >
> >
> >
> >
> > On 18 November 2015 at 13:47, Karl Wright <daddywri@gmail.com> wrote:
> >
> > > Thanks, Chalitha, for contributing this!
> > >
> > > I hope to have a look at the code also, but it won't happen until next
> > week
> > > I'm afraid.
> > >
> > > Karl
> > >
> > >
> > > On Wed, Nov 18, 2015 at 7:44 AM, Rafa Haro <rharoapache@gmail.com>
> > wrote:
> > >
> > > > Hi Chalitha!
> > > >
> > > >
> > > >
> > > >
> > > > Awesome!. I will take a look to this as soon as possible.
> > > >
> > > >
> > > >
> > > >
> > > > Cheers,
> > > >
> > > > Rafa
> > > >
> > > > On Wed, Nov 18, 2015 at 1:22 PM, chalitha udara Perera
> > > > <chalithaudara@gmail.com> wrote:
> > > >
> > > > > Hi All,
> > > > > I have worked on a OpenNLP based transformation connector for some
> > > > > requirement. Given a document it extracts named entities such as
> > > people,
> > > > > locations and organisations and add those as metadata to repository
> > > > > document.
> > > > > If you think this will be useful for the community, I would like
to
> > > > > contribute it to manifoldcf.
> > > > > Connector code is available here [1].
> > > > > [1] https://github.com/ChalithaUdara/OpenNLP-Manifold-Connector
> > > > > Thanks,
> > > > > Chalitha
> > > > > --
> > > > > J.M Chalitha Udara Perera
> > > > > *Department of Computer Science and Engineering,*
> > > > > *University of Moratuwa,*
> > > > > *Sri Lanka*
> > > >
> > >
> >
> >
> >
> > --
> > --------------------------
> >
> > Benedetti Alessandro
> > Visiting card : http://about.me/alessandro_benedetti
> >
> > "Tyger, tyger burning bright
> > In the forests of the night,
> > What immortal hand or eye
> > Could frame thy fearful symmetry?"
> >
> > William Blake - Songs of Experience -1794 England
> >
>



-- 
J.M Chalitha Udara Perera

*Department of Computer Science and Engineering,*
*University of Moratuwa,*
*Sri Lanka*

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message