manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Contributing OpenNLP connector
Date Wed, 18 Nov 2015 15:51:35 GMT
There's another problem with:

String textContent = new String(bytes);

Specifically, (1) its operation will vary with the locale of the machine
it's being run on, and (2) there's no limit to the amount of memory that
this could conceivably require.  Both are problems.  If you could use a
stream you would be much better off.

Karl


On Wed, Nov 18, 2015 at 9:20 AM, Alessandro Benedetti <abenedetti@apache.org
> wrote:

> Hey Chal,
> First of all thanks you very much for the contribution!
> I have some observations :
>
> *Model Downloading*
>
> Taking the look to the way you provide the user with the models, I can see
> there is a shell script to download very specific english models.
> It would be great having the possibility to configure the model to use in
> the connector config UI .
> In particular I see two possibilities :
> 1) you provide a select list per model required and then automatically you
> download the model and install it
> 2) you provide the user with the possibility of uploading the model he/she
> wants to use ( more flexible, but the user will need to download a model on
> his own)
> In my opinion is really important to keep the transformation connector
> flexible, able to work with different languages and models.
>
> *Text enrichment*
> Taking a look to the code I see in here a really strong assumption :
>
> String textContent = new String(bytes);
>
> This means you assume the only input possible is plain text.
> Actually as we know we have the binary there, not necessary a plain string.
> I think we need to specify the Tika Transformer to be a requirement for
> this connector.
> Furthermore I would suggest the possibility for the user to select the list
> of input fields to be considered to be the source of the extraction.
>
> e.g.
> I can configure my extraction to happen from title,text and description.
>
> Of course it is required a Transformer Connector to happen before the
> OpenNLP one, to provide those fields.
> These are quick considerations after a first look to the code, happy to
> discuss and help further :)
>
> Cheers
>
>
>
>
> On 18 November 2015 at 13:47, Karl Wright <daddywri@gmail.com> wrote:
>
> > Thanks, Chalitha, for contributing this!
> >
> > I hope to have a look at the code also, but it won't happen until next
> week
> > I'm afraid.
> >
> > Karl
> >
> >
> > On Wed, Nov 18, 2015 at 7:44 AM, Rafa Haro <rharoapache@gmail.com>
> wrote:
> >
> > > Hi Chalitha!
> > >
> > >
> > >
> > >
> > > Awesome!. I will take a look to this as soon as possible.
> > >
> > >
> > >
> > >
> > > Cheers,
> > >
> > > Rafa
> > >
> > > On Wed, Nov 18, 2015 at 1:22 PM, chalitha udara Perera
> > > <chalithaudara@gmail.com> wrote:
> > >
> > > > Hi All,
> > > > I have worked on a OpenNLP based transformation connector for some
> > > > requirement. Given a document it extracts named entities such as
> > people,
> > > > locations and organisations and add those as metadata to repository
> > > > document.
> > > > If you think this will be useful for the community, I would like to
> > > > contribute it to manifoldcf.
> > > > Connector code is available here [1].
> > > > [1] https://github.com/ChalithaUdara/OpenNLP-Manifold-Connector
> > > > Thanks,
> > > > Chalitha
> > > > --
> > > > J.M Chalitha Udara Perera
> > > > *Department of Computer Science and Engineering,*
> > > > *University of Moratuwa,*
> > > > *Sri Lanka*
> > >
> >
>
>
>
> --
> --------------------------
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message