manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Contributing OpenNLP connector
Date Thu, 19 Nov 2015 07:01:19 GMT
Hi Chalitha,

My comment was about encoding, not about languages.  If you are assuming
that the binary document stream is utf-8 (which will be the output of the
Tika transformer), then you *must* specify utf-8 as the encoding when you
convert it back to a string.  Otherwise you will have data corruption.

Thanks,
Karl




On Thu, Nov 19, 2015 at 12:45 AM, chalitha udara Perera <
chalithaudara@gmail.com> wrote:

> Hi guys,
>
> Thank you very much for comments and suggestions !
>
> As Alessandro said, I have assumed the use of Tika connector prior to using
> the OpenNLP connector.
> I think it is a valid assumption because tika parses different sources in
> to common format, so the future
> transformation connectors can largely benefit from the use of tika in the
> connectors chain.
>
> Regarding the language issue, currently I implemented it to work with
> English language content.
> But I agree with Alessandro and connector can be made to support different
> languages. Currently OpenNLP
> has following models [1]. NER models are available for en, es and nl
> languages. but it is possible to train models
> for other languages as well.
>
> Tika can be used to detect language from document (As far as I know Stanbol
> does that), If we assumed the use of tika connector before OpenNLP
> connector, we can use language to direct to correct model. In this case we
> have to download all the NER models and
> reference them in code.
> Please give your suggestions on how language support should be included in
> the OpenNLP connector
>
> Thanks,
> Chalitha
>
> [1] http://opennlp.sourceforge.net/models-1.5/
>
> On Wed, Nov 18, 2015 at 9:21 PM, Karl Wright <daddywri@gmail.com> wrote:
>
> > There's another problem with:
> >
> > String textContent = new String(bytes);
> >
> > Specifically, (1) its operation will vary with the locale of the machine
> > it's being run on, and (2) there's no limit to the amount of memory that
> > this could conceivably require.  Both are problems.  If you could use a
> > stream you would be much better off.
> >
> > Karl
> >
> >
> > On Wed, Nov 18, 2015 at 9:20 AM, Alessandro Benedetti <
> > abenedetti@apache.org
> > > wrote:
> >
> > > Hey Chal,
> > > First of all thanks you very much for the contribution!
> > > I have some observations :
> > >
> > > *Model Downloading*
> > >
> > > Taking the look to the way you provide the user with the models, I can
> > see
> > > there is a shell script to download very specific english models.
> > > It would be great having the possibility to configure the model to use
> in
> > > the connector config UI .
> > > In particular I see two possibilities :
> > > 1) you provide a select list per model required and then automatically
> > you
> > > download the model and install it
> > > 2) you provide the user with the possibility of uploading the model
> > he/she
> > > wants to use ( more flexible, but the user will need to download a
> model
> > on
> > > his own)
> > > In my opinion is really important to keep the transformation connector
> > > flexible, able to work with different languages and models.
> > >
> > > *Text enrichment*
> > > Taking a look to the code I see in here a really strong assumption :
> > >
> > > String textContent = new String(bytes);
> > >
> > > This means you assume the only input possible is plain text.
> > > Actually as we know we have the binary there, not necessary a plain
> > string.
> > > I think we need to specify the Tika Transformer to be a requirement for
> > > this connector.
> > > Furthermore I would suggest the possibility for the user to select the
> > list
> > > of input fields to be considered to be the source of the extraction.
> > >
> > > e.g.
> > > I can configure my extraction to happen from title,text and
> description.
> > >
> > > Of course it is required a Transformer Connector to happen before the
> > > OpenNLP one, to provide those fields.
> > > These are quick considerations after a first look to the code, happy to
> > > discuss and help further :)
> > >
> > > Cheers
> > >
> > >
> > >
> > >
> > > On 18 November 2015 at 13:47, Karl Wright <daddywri@gmail.com> wrote:
> > >
> > > > Thanks, Chalitha, for contributing this!
> > > >
> > > > I hope to have a look at the code also, but it won't happen until
> next
> > > week
> > > > I'm afraid.
> > > >
> > > > Karl
> > > >
> > > >
> > > > On Wed, Nov 18, 2015 at 7:44 AM, Rafa Haro <rharoapache@gmail.com>
> > > wrote:
> > > >
> > > > > Hi Chalitha!
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > Awesome!. I will take a look to this as soon as possible.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > Cheers,
> > > > >
> > > > > Rafa
> > > > >
> > > > > On Wed, Nov 18, 2015 at 1:22 PM, chalitha udara Perera
> > > > > <chalithaudara@gmail.com> wrote:
> > > > >
> > > > > > Hi All,
> > > > > > I have worked on a OpenNLP based transformation connector for
> some
> > > > > > requirement. Given a document it extracts named entities such
as
> > > > people,
> > > > > > locations and organisations and add those as metadata to
> repository
> > > > > > document.
> > > > > > If you think this will be useful for the community, I would
like
> to
> > > > > > contribute it to manifoldcf.
> > > > > > Connector code is available here [1].
> > > > > > [1] https://github.com/ChalithaUdara/OpenNLP-Manifold-Connector
> > > > > > Thanks,
> > > > > > Chalitha
> > > > > > --
> > > > > > J.M Chalitha Udara Perera
> > > > > > *Department of Computer Science and Engineering,*
> > > > > > *University of Moratuwa,*
> > > > > > *Sri Lanka*
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > --------------------------
> > >
> > > Benedetti Alessandro
> > > Visiting card : http://about.me/alessandro_benedetti
> > >
> > > "Tyger, tyger burning bright
> > > In the forests of the night,
> > > What immortal hand or eye
> > > Could frame thy fearful symmetry?"
> > >
> > > William Blake - Songs of Experience -1794 England
> > >
> >
>
>
>
> --
> J.M Chalitha Udara Perera
>
> *Department of Computer Science and Engineering,*
> *University of Moratuwa,*
> *Sri Lanka*
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message