From user-return-5632-archive-asf-public=cust-asf.ponee.io@manifoldcf.apache.org Wed Dec 12 12:17:06 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 33344180676 for ; Wed, 12 Dec 2018 12:17:05 +0100 (CET) Received: (qmail 37251 invoked by uid 500); 12 Dec 2018 11:17:04 -0000 Mailing-List: contact user-help@manifoldcf.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@manifoldcf.apache.org Delivered-To: mailing list user@manifoldcf.apache.org Received: (qmail 37131 invoked by uid 99); 12 Dec 2018 11:17:04 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Dec 2018 11:17:04 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 9A853C830D for ; Wed, 12 Dec 2018 11:17:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.808 X-Spam-Level: * X-Spam-Status: No, score=1.808 tagged_above=-999 required=6.31 tests=[DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001, T_MIXED_ES=0.01] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id ezg1UJaiKp_V for ; Wed, 12 Dec 2018 11:17:00 +0000 (UTC) Received: from mail-wr1-f49.google.com (mail-wr1-f49.google.com [209.85.221.49]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 2A7115F56F for ; Wed, 12 Dec 2018 11:17:00 +0000 (UTC) Received: by mail-wr1-f49.google.com with SMTP id c14so17316238wrr.0 for ; Wed, 12 Dec 2018 03:17:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=MC6lw+fwCCk4FMKPKTdo800DLOYBH9YBPhxtE+yT1CA=; b=cOJo8JOOzjwb2mHfBXjREIS2DczFHDRrdW9I3accBE5MkJjoQSJ0xZrynC4qRQSI3+ n919QqQJTBMWS/3+PveGbmhVmG41QMR4hXdXTdo2BF7EHVplpZFLgPrIms1JgYbYPxs+ Mj6W7LB3JBrRftcgPQJyEM8ywV6CpwzF4v5K80sIvbofkdDXRRpOflkpoyW4Q/TiwPRH BTzEZW4Sp3Q1p6akyGXS0zsXJ8vnxurjUWAt/JbT9DUi1st94OVhbEQrkYxjukMgBRm0 VL5cqqHzHdLPL+kTlNUH49iTHFtMIOFWypSn1yGaYYpNB2sL4jzh/jp7z4yhgpWSPqxY qI2g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=MC6lw+fwCCk4FMKPKTdo800DLOYBH9YBPhxtE+yT1CA=; b=AL9E97jLMwRFtQM/xWoWrvn+3j0BpyTWuMr+UcYZ4nJV68MmXv9jBD1tqDkMijm2sc jjTtDqOdp5NFikGI+p9qMYJI0KR/WMFL6j8LU/vFyHmZcisbpbdpeZ7NFjNiBkTwpfgI +tCDAbKvOmkw+VpNtsfBe3FIaVqi0c3NXUT2db5rB2lj+gf/GA986yHwHtv89LqN70I7 EMnto3NRvWvkdzsAhaJrUomBNcLYCGxcyHtO+mnMMJy+y0uBD66eVOhkfPUoQrGUtjCX EMkG919AjitDUEMrAOiLfIwG13yV8O9Xm7RGMM9dQ0UyxeJshqX6YC3r9pga3y1BrqR5 eo/A== X-Gm-Message-State: AA+aEWaVPArRDhtjFneAiCyHUzbQlBqeCUPh94xHdfCLlF6t7iGTdov6 Igj+Fz1UcNdXZkAgVmmEoOZ53XkaTLXb+Zq7ZdoUOA== X-Google-Smtp-Source: AFSGD/UIrk/UQqnmmbpOLzzEnlDaoRWMo2fUmCItHVeB/+gS6JpXqZzvKmYDycFMLLgJCj/Ws4SHI9ny25jBjJqBKTQ= X-Received: by 2002:a05:6000:100f:: with SMTP id a15mr17687035wrx.298.1544613414227; Wed, 12 Dec 2018 03:16:54 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Karl Wright Date: Wed, 12 Dec 2018 06:16:41 -0500 Message-ID: Subject: Re: Language Detection for the data To: user@manifoldcf.apache.org Content-Type: multipart/alternative; boundary="000000000000ded316057cd15529" --000000000000ded316057cd15529 Content-Type: text/plain; charset="UTF-8" Hi Nikita, This is occurring because en_GB does not have a translations file. It's a warning and the code falls back to using en_US. Karl On Wed, Dec 12, 2018 at 4:39 AM Nikita Ahuja wrote: > Hi Karl, > > Thanks for the suggestion and Language for the data and content is able to > detect now. But there is one issue while ingesting the records in the > ElasticSearch Index. and it is stored there in the log file as: > > ERROR 2018-12-11T19:19:37,637 (qtp348148678-561) - Missing resource bundle > 'org.apache.manifoldcf.ui.i18n.common' for locale 'en_GB': Can't find > bundle for base name org.apache.manifoldcf.ui.i18n.common, locale en_GB; > trying en > java.util.MissingResourceException: Can't find bundle for base name > org.apache.manifoldcf.ui.i18n.common, locale en_GB > at > java.base/java.util.ResourceBundle.throwMissingResourceException(Unknown > Source) ~[?:?] > at java.base/java.util.ResourceBundle.getBundleImpl(Unknown Source) > ~[?:?] > at java.base/java.util.ResourceBundle.getBundleImpl(Unknown Source) > ~[?:?] > at java.base/java.util.ResourceBundle.getBundle(Unknown Source) ~[?:?] > at > org.apache.manifoldcf.core.i18n.Messages.getResourceBundle(Messages.java:132) > [mcf-core.jar:?] > at > org.apache.manifoldcf.core.i18n.Messages.getMessage(Messages.java:178) > [mcf-core.jar:?] > at > org.apache.manifoldcf.core.i18n.Messages.getString(Messages.java:216) > [mcf-core.jar:?] > at > org.apache.manifoldcf.ui.i18n.Messages.getBodyJavascriptString(Messages.java:343) > [mcf-ui-core.jar:?] > at > org.apache.manifoldcf.ui.i18n.Messages.getBodyJavascriptString(Messages.java:119) > [mcf-ui-core.jar:?] > at > org.apache.manifoldcf.ui.i18n.Messages.getBodyJavascriptString(Messages.java:67) > [mcf-ui-core.jar:?] > at org.apache.jsp.index_jsp._jspService(index_jsp.java:212) [jsp/:?] > > > Is this can be resolved after adding any resource files or any other > solution has to be opted? > > On Wed, Nov 21, 2018 at 5:36 PM Karl Wright wrote: > >> Hi Nikita, >> >> The Tika transformer may well generate a language attribute. You would >> need to check with Tika, though, to know for sure, and under what >> conditions it might generate this. It should not be confused with document >> format detection, which Tika definitely does in order to extract content. >> >> It looks like language detection in Tika either comes from document >> metadata already present, or via a Java interface that you need to >> explicitly call to get it. If your documents need the latter, the Tika >> connector does not currently do this: >> >> https://tika.apache.org/1.19.1/detection.html#Language_Detection >> >> and >> >> https://tika.apache.org/1.19.1/examples.html#Language_Identification >> >> The documentation does not clarify whether a language attribute is >> actually generated; the architecture seems more suited to plug in machine >> translators for your content. I suspect you would need to run the output >> of the Tika translator into the NullOutputConnector in order to see what >> attributes are being generated to know for sure. >> >> Karl >> >> >> On Wed, Nov 21, 2018 at 4:45 AM Nikita Ahuja >> wrote: >> >>> HI All, >>> >>> Thanks for the timely replies. But I am basically concerned for the >>> language detection of the .doc,.pdf or any other data present in the >>> repository. >>> >>> As per my understanding Tika Transformation provides functionality for >>> the same. >>> But there is no output for the language of the documents. >>> >>> The sequence used is: >>> 1. Repoistory Connector(Web) >>> 2. Tika Transformation >>> 3. MetaData Adjuster >>> 4.Output Connector(Elastic) >>> >>> Is there anything which is being missed here for the language detection >>> of the documents? >>> >>> >>> >>> >>> >>> On Wed, Nov 21, 2018 at 2:35 PM Furkan KAMACI >>> wrote: >>> >>>> Hi Nikita, >>>> >>>> First of all, OpenNLP is a transformation connector at ManifoldCF and >>>> should be enabled by default. It extracts named entities (people, locations >>>> and organizations) from document. >>>> >>>> You should download trained models to run OpenNLP connector. You can >>>> check here for such purpose: https://opennlp.apache.org/models.html >>>> >>>> Check here for a detailed explanation: >>>> https://github.com/ChalithaUdara/OpenNLP-Manifold-Connector >>>> >>>> Feel free to ask any questions when you try to integrate it. Also, you >>>> should explain the points if you cannot success to run it. >>>> >>>> Kind Regards, >>>> Furkan KAMACI >>>> >>>> >>>> On Wed, Nov 21, 2018 at 11:54 AM Karl Wright >>>> wrote: >>>> >>>>> Hi Nikita, >>>>> >>>>> Can you be more specific when you say "OpenNLP is not working"? All >>>>> that this connector does is integrate OpenNLP as a ManifoldCF transformer. >>>>> It uses a specific directory to deliver the models that OpenNLP uses to >>>>> match and extract content from documents. Thus, you can provide any models >>>>> you want that are compatible with the OpenNLP version we're including. >>>>> >>>>> Can you describe the steps you are taking and what you are seeing? >>>>> >>>>> On Wed, Nov 21, 2018 at 12:44 AM Nikita Ahuja >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I have query related to detect the language of the records/data which >>>>>> is going to be ingest in the Output Connector. >>>>>> >>>>>> OpenNLP connector is not working for the detection as per the user >>>>>> documentation, but this is not working appropriately. Please suggest is NLP >>>>>> has to be used if yes, then how it should be used or is there any other >>>>>> solution for this? >>>>>> >>>>>> -- >>>>>> Thanks and Regards, >>>>>> Nikita >>>>>> Email: nikita@smartshore.nl >>>>>> United Sources Service Pvt. Ltd. >>>>>> a "Smartshore" Company >>>>>> Mobile: +91 99 888 57720 >>>>>> http://www.smartshore.nl >>>>>> >>>>> >>> >>> -- >>> Thanks and Regards, >>> Nikita >>> Email: nikita@smartshore.nl >>> United Sources Service Pvt. Ltd. >>> a "Smartshore" Company >>> Mobile: +91 99 888 57720 >>> http://www.smartshore.nl >>> >> > > -- > Thanks and Regards, > Nikita > Email: nikita@smartshore.nl > United Sources Service Pvt. Ltd. > a "Smartshore" Company > Mobile: +91 99 888 57720 > http://www.smartshore.nl > --000000000000ded316057cd15529 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Nikita,

This is occurring because en= _GB does not have a translations file.=C2=A0 It's a warning and the cod= e falls back to using en_US.

Karl

<= br>
On Wed, Dec 12, 2018 at 4:39= AM Nikita Ahuja <nikita@smartsh= ore.nl> wrote:
Hi Karl,

Thanks for the suggestio= n and Language for the data and content is able to detect now. But there is= one issue while ingesting the records in the ElasticSearch Index. and it i= s stored there in the log file as:

ERROR 2018-12-11T19:19:37,637 (qtp348148678-561) - Missing= resource bundle 'org.apache.manifoldcf.ui.i18n.common' for locale = 'en_GB': Can't find bundle for base name org.apache.manifoldcf.= ui.i18n.common, locale en_GB; trying en
java.util.MissingResource= Exception: Can't find bundle for base name org.apache.manifoldcf.ui.i18= n.common, locale en_GB
=C2=A0=C2=A0=C2=A0=C2=A0at java.base/java.= util.ResourceBundle.throwMissingResourceException(Unknown Source) ~[?:?]
=C2=A0=C2=A0=C2=A0=C2=A0at java.base/java.util.ResourceBundle.getBu= ndleImpl(Unknown Source) ~[?:?]
=C2=A0=C2=A0=C2=A0=C2=A0at java.b= ase/java.util.ResourceBundle.getBundleImpl(Unknown Source) ~[?:?]
=C2=A0=C2=A0=C2=A0=C2=A0at java.base/java.util.ResourceBundle.getBundle(Un= known Source) ~[?:?]
=C2=A0=C2=A0=C2=A0=C2=A0at org.apache.manifo= ldcf.core.i18n.Messages.getResourceBundle(Messages.java:132) [mcf-core.jar:= ?]
=C2=A0=C2=A0=C2=A0=C2=A0at org.apache.manifoldcf.core.i18n.Mes= sages.getMessage(Messages.java:178) [mcf-core.jar:?]
=C2=A0=C2=A0= =C2=A0=C2=A0at org.apache.manifoldcf.core.i18n.Messages.getString(Messages.= java:216) [mcf-core.jar:?]
=C2=A0=C2=A0=C2=A0=C2=A0at org.apache.= manifoldcf.ui.i18n.Messages.getBodyJavascriptString(Messages.java:343) [mcf= -ui-core.jar:?]
=C2=A0=C2=A0=C2=A0=C2=A0at org.apache.manifoldcf.= ui.i18n.Messages.getBodyJavascriptString(Messages.java:119) [mcf-ui-core.ja= r:?]
=C2=A0=C2=A0=C2=A0=C2=A0at org.apache.manifoldcf.ui.i18n.Mes= sages.getBodyJavascriptString(Messages.java:67) [mcf-ui-core.jar:?]
=C2=A0=C2=A0=C2=A0=C2=A0at org.apache.jsp.index_jsp._jspService(index_js= p.java:212) [jsp/:?]


Is this can be resolved after adding any resource files or any oth= er solution has to be opted?

On Wed, Nov 21, 2018 at 5:36 PM Karl Wright <daddywri@gmail.com> wrote:=
Hi Nikita,

The Tika = transformer may well generate a language attribute.=C2=A0 You would need to= check with Tika, though, to know for sure, and under what conditions it mi= ght generate this.=C2=A0 It should not be confused with document format det= ection, which Tika definitely does in order to extract content.

It l= ooks like language detection in Tika either comes from document metadata al= ready present, or via a Java interface that you need to explicitly call to = get it.=C2=A0 If your documents need the latter, the Tika connector does no= t currently do this:

https://tika.apache.org/1.1= 9.1/detection.html#Language_Detection

and

https://tika.apache.org/1.19.1/examples.html#Language_Identificati= on

The documentation does not clarify whether a language attribu= te is actually generated; the architecture seems more suited to plug in mac= hine translators for your content.=C2=A0 I suspect you would need to run th= e output of the Tika translator into the NullOutputConnector in order to se= e what attributes are being generated to know for sure.

Karl


On Wed, Nov 21, 2018 at 4:45 AM Nikita Ahuja <nikita@smartshore.nl> wrote:
=
HI= All,

Thanks for the timely replies. But I am basically = concerned for the language detection of the .doc,.pdf or any other data pre= sent in the repository.

As per my understanding Tika Transformation = provides functionality for the same.=C2=A0
But there is no output for th= e language of the documents.

The sequence used is:
1. Repo= istory Connector(Web)
2. Tika Transformation
3. MetaDat= a Adjuster
4.Output Connector(Elastic)

Is there any= thing which is being missed here for the language detection of the document= s?





On Wed, Nov 21, 2018 at 2:35 PM Furkan KAMACI <= furkankamaci@gm= ail.com> wrote:
= Hi Nikita,

First of all, OpenNLP is a transformation con= nector at ManifoldCF and should be enabled by default.=C2=A0It extracts nam= ed entities (people, locations and organizations) from document.
=
You should download trained models to run OpenNLP connector.= You can check here for such purpose:=C2=A0https://opennlp.apache.org/models.html=

Check here for a detailed explanation:=C2=A0<= a href=3D"https://github.com/ChalithaUdara/OpenNLP-Manifold-Connector" targ= et=3D"_blank">https://github.com/ChalithaUdara/OpenNLP-Manifold-Connector

Feel free to ask any questions when you try to = integrate it. Also, you should explain the points if you cannot success to = run it.

Kind Regards,
Furkan KAMACI


On Wed, Nov 21, 2018 at 11:54 AM Karl Wright <daddywri@gmail.com> wrote= :
Hi Nikita,

Can you be more specific when you say &quo= t;OpenNLP is not working"?=C2=A0 All that this connector does is integ= rate OpenNLP as a ManifoldCF transformer.=C2=A0 It uses a specific director= y to deliver the models that OpenNLP uses to match and extract content from= documents.=C2=A0 Thus, you can provide any models you want that are compat= ible with the OpenNLP version we're including.

Can you describe = the steps you are taking and what you are seeing?

On Wed, Nov 21, 2018 at 12:44 AM Nikita A= huja <nikita@s= martshore.nl> wrote:
Hi,

I have query related to= detect the language of the records/data which is going to be ingest in the= Output Connector.

OpenNLP connector is not working for the detectio= n as per the user documentation, but this is not working appropriately. Ple= ase suggest is NLP has to be used if yes, then how it should be used or is = there any other solution for=C2=A0this?

-- =
Thanks and Regards,=
Nikita
United Sources Service Pvt. Ltd.
<= div style=3D"font-size:12.8px">a "Smartshore" Company=
Mobile: +91 99 888 57720

http://www.smartshore.nl
=


--
<= div dir=3D"ltr">Thanks and Regards,
Nikita
United Sources Service Pvt. Ltd.
a "Smartshore" Company
Mobil= e: +91 99 888 57720

http://www.smartshore.nl


--
Thanks and= Regards,
Nikita
United Sources Service Pvt. Ltd.=
a "Smartshore" Comp= any
Mobile: +91 99 888 57720

http://www.smartshore.nl=
--000000000000ded316057cd15529--