manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From chalitha udara Perera <chalithaud...@gmail.com>
Subject Re: Contributing OpenNLP connector
Date Thu, 19 Nov 2015 07:06:14 GMT
Hi Karl,

I will fix that encoding issue.

Thanks,
Chalitha

On Thu, Nov 19, 2015 at 12:31 PM, Karl Wright <daddywri@gmail.com> wrote:

> Hi Chalitha,
>
> My comment was about encoding, not about languages.  If you are assuming
> that the binary document stream is utf-8 (which will be the output of the
> Tika transformer), then you *must* specify utf-8 as the encoding when you
> convert it back to a string.  Otherwise you will have data corruption.
>
> Thanks,
> Karl
>
>
>
>
> On Thu, Nov 19, 2015 at 12:45 AM, chalitha udara Perera <
> chalithaudara@gmail.com> wrote:
>
> > Hi guys,
> >
> > Thank you very much for comments and suggestions !
> >
> > As Alessandro said, I have assumed the use of Tika connector prior to
> using
> > the OpenNLP connector.
> > I think it is a valid assumption because tika parses different sources in
> > to common format, so the future
> > transformation connectors can largely benefit from the use of tika in the
> > connectors chain.
> >
> > Regarding the language issue, currently I implemented it to work with
> > English language content.
> > But I agree with Alessandro and connector can be made to support
> different
> > languages. Currently OpenNLP
> > has following models [1]. NER models are available for en, es and nl
> > languages. but it is possible to train models
> > for other languages as well.
> >
> > Tika can be used to detect language from document (As far as I know
> Stanbol
> > does that), If we assumed the use of tika connector before OpenNLP
> > connector, we can use language to direct to correct model. In this case
> we
> > have to download all the NER models and
> > reference them in code.
> > Please give your suggestions on how language support should be included
> in
> > the OpenNLP connector
> >
> > Thanks,
> > Chalitha
> >
> > [1] http://opennlp.sourceforge.net/models-1.5/
> >
> > On Wed, Nov 18, 2015 at 9:21 PM, Karl Wright <daddywri@gmail.com> wrote:
> >
> > > There's another problem with:
> > >
> > > String textContent = new String(bytes);
> > >
> > > Specifically, (1) its operation will vary with the locale of the
> machine
> > > it's being run on, and (2) there's no limit to the amount of memory
> that
> > > this could conceivably require.  Both are problems.  If you could use a
> > > stream you would be much better off.
> > >
> > > Karl
> > >
> > >
> > > On Wed, Nov 18, 2015 at 9:20 AM, Alessandro Benedetti <
> > > abenedetti@apache.org
> > > > wrote:
> > >
> > > > Hey Chal,
> > > > First of all thanks you very much for the contribution!
> > > > I have some observations :
> > > >
> > > > *Model Downloading*
> > > >
> > > > Taking the look to the way you provide the user with the models, I
> can
> > > see
> > > > there is a shell script to download very specific english models.
> > > > It would be great having the possibility to configure the model to
> use
> > in
> > > > the connector config UI .
> > > > In particular I see two possibilities :
> > > > 1) you provide a select list per model required and then
> automatically
> > > you
> > > > download the model and install it
> > > > 2) you provide the user with the possibility of uploading the model
> > > he/she
> > > > wants to use ( more flexible, but the user will need to download a
> > model
> > > on
> > > > his own)
> > > > In my opinion is really important to keep the transformation
> connector
> > > > flexible, able to work with different languages and models.
> > > >
> > > > *Text enrichment*
> > > > Taking a look to the code I see in here a really strong assumption :
> > > >
> > > > String textContent = new String(bytes);
> > > >
> > > > This means you assume the only input possible is plain text.
> > > > Actually as we know we have the binary there, not necessary a plain
> > > string.
> > > > I think we need to specify the Tika Transformer to be a requirement
> for
> > > > this connector.
> > > > Furthermore I would suggest the possibility for the user to select
> the
> > > list
> > > > of input fields to be considered to be the source of the extraction.
> > > >
> > > > e.g.
> > > > I can configure my extraction to happen from title,text and
> > description.
> > > >
> > > > Of course it is required a Transformer Connector to happen before the
> > > > OpenNLP one, to provide those fields.
> > > > These are quick considerations after a first look to the code, happy
> to
> > > > discuss and help further :)
> > > >
> > > > Cheers
> > > >
> > > >
> > > >
> > > >
> > > > On 18 November 2015 at 13:47, Karl Wright <daddywri@gmail.com>
> wrote:
> > > >
> > > > > Thanks, Chalitha, for contributing this!
> > > > >
> > > > > I hope to have a look at the code also, but it won't happen until
> > next
> > > > week
> > > > > I'm afraid.
> > > > >
> > > > > Karl
> > > > >
> > > > >
> > > > > On Wed, Nov 18, 2015 at 7:44 AM, Rafa Haro <rharoapache@gmail.com>
> > > > wrote:
> > > > >
> > > > > > Hi Chalitha!
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > Awesome!. I will take a look to this as soon as possible.
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > Cheers,
> > > > > >
> > > > > > Rafa
> > > > > >
> > > > > > On Wed, Nov 18, 2015 at 1:22 PM, chalitha udara Perera
> > > > > > <chalithaudara@gmail.com> wrote:
> > > > > >
> > > > > > > Hi All,
> > > > > > > I have worked on a OpenNLP based transformation connector
for
> > some
> > > > > > > requirement. Given a document it extracts named entities
such
> as
> > > > > people,
> > > > > > > locations and organisations and add those as metadata to
> > repository
> > > > > > > document.
> > > > > > > If you think this will be useful for the community, I would
> like
> > to
> > > > > > > contribute it to manifoldcf.
> > > > > > > Connector code is available here [1].
> > > > > > > [1]
> https://github.com/ChalithaUdara/OpenNLP-Manifold-Connector
> > > > > > > Thanks,
> > > > > > > Chalitha
> > > > > > > --
> > > > > > > J.M Chalitha Udara Perera
> > > > > > > *Department of Computer Science and Engineering,*
> > > > > > > *University of Moratuwa,*
> > > > > > > *Sri Lanka*
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > --------------------------
> > > >
> > > > Benedetti Alessandro
> > > > Visiting card : http://about.me/alessandro_benedetti
> > > >
> > > > "Tyger, tyger burning bright
> > > > In the forests of the night,
> > > > What immortal hand or eye
> > > > Could frame thy fearful symmetry?"
> > > >
> > > > William Blake - Songs of Experience -1794 England
> > > >
> > >
> >
> >
> >
> > --
> > J.M Chalitha Udara Perera
> >
> > *Department of Computer Science and Engineering,*
> > *University of Moratuwa,*
> > *Sri Lanka*
> >
>



-- 
J.M Chalitha Udara Perera

*Department of Computer Science and Engineering,*
*University of Moratuwa,*
*Sri Lanka*

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message