From user-return-5631-archive-asf-public=cust-asf.ponee.io@manifoldcf.apache.org Wed Dec 12 10:39:05 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id CEC85180676 for ; Wed, 12 Dec 2018 10:39:04 +0100 (CET) Received: (qmail 78095 invoked by uid 500); 12 Dec 2018 09:39:03 -0000 Mailing-List: contact user-help@manifoldcf.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@manifoldcf.apache.org Delivered-To: mailing list user@manifoldcf.apache.org Received: (qmail 78085 invoked by uid 99); 12 Dec 2018 09:39:03 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Dec 2018 09:39:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 4787AC82EA for ; Wed, 12 Dec 2018 09:39:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.008 X-Spam-Level: ** X-Spam-Status: No, score=2.008 tagged_above=-999 required=6.31 tests=[DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001, T_MIXED_ES=0.01] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=smartshore-nl.20150623.gappssmtp.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id NZi5m-VKpgkg for ; Wed, 12 Dec 2018 09:39:00 +0000 (UTC) Received: from mail-vs1-f54.google.com (mail-vs1-f54.google.com [209.85.217.54]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 3CC655F2AA for ; Wed, 12 Dec 2018 09:38:59 +0000 (UTC) Received: by mail-vs1-f54.google.com with SMTP id v205so10693277vsc.3 for ; Wed, 12 Dec 2018 01:38:59 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=smartshore-nl.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=RFXi8AVD7We6rEMqJ6LtyUGAcwhLpKfF8KiiqsKpvTo=; b=a870CHG7G43nZSsQcfr2heR36VIACFEHVEZzqgu3hOYFQ4GDPzZAqvstvMuONPn3NN XdSyf8OTJ9uwTw7mV2tE3N4aZZyMNc5Uh4VRxbuNyWP6h1tPKnHP7HAP+aOSCeiIZODs qGu2lXzuxroy8oHVHfOcHS9JU+v7CgpG7+ulqW6xPi95In/d8nfp8ukYCOegqKc8yKZy PnHutOJPgQKRXRxyZNGb+hYa0d2A9SRdG9BuNwda1cakzThYJtekzl6XLYxJOVUqpFK4 WkQ55PIWIPMiq/gWjivVtpPSRhpHUp398qHKU2gpvDqSHUT7Px2/vos1dJ5iuwJnmuRp IndA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=RFXi8AVD7We6rEMqJ6LtyUGAcwhLpKfF8KiiqsKpvTo=; b=BK2gd+aBtggv9F/K9e8iNf0zjASsKjyo/i+zvJmc9uIKQ9rmERl58sRQc8Jfyni4Gd LRYWS+6Ipu6EtLB45nFSDrYjxy+7EZfz8loTujQ5B/ZMxbLyzxlwkzliYBmWgLOUN3N9 xCkOvef5sI8Gjjgkp6WhUM6yOMIMxybDNnvByXjZvPtTqZGnnNtxzsK6WT5TXxJCa/WD W1nVxT3IYP+uSnIQEGB+qUmg+gDgCQMSjggnuXSSrlNvvOXvp692VM/YsEd0UHv3STD1 C1ABm3yaIXNYnjIn7/X8LZPTWtcbabJ8rTAyXoYvY+bpEZVTXc8lP/F/f8loNLpeX5py JARg== X-Gm-Message-State: AA+aEWZa4ifen7JAx/YgY8ONQi6x4X5sXF9tW1la47ZyYARr1zQy/cGn I7VvYuXlGBqAXRdjwfafBo9WzpcXC9HrfO+cWp6kHxKXUBFmrw== X-Google-Smtp-Source: AFSGD/VOb8QLJ74whIB+JmhPoqUNO5b4gyzX+jLqDBsdZYHFE0jNjA8vodatyJyt/MdCL7FHUw09bggc2kjqHBHq7I8= X-Received: by 2002:a67:8a81:: with SMTP id m123mr9213501vsd.206.1544607539274; Wed, 12 Dec 2018 01:38:59 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Nikita Ahuja Date: Wed, 12 Dec 2018 15:08:48 +0530 Message-ID: Subject: Re: Language Detection for the data To: user@manifoldcf.apache.org Content-Type: multipart/alternative; boundary="000000000000b23977057ccff7d0" --000000000000b23977057ccff7d0 Content-Type: text/plain; charset="UTF-8" Hi Karl, Thanks for the suggestion and Language for the data and content is able to detect now. But there is one issue while ingesting the records in the ElasticSearch Index. and it is stored there in the log file as: ERROR 2018-12-11T19:19:37,637 (qtp348148678-561) - Missing resource bundle 'org.apache.manifoldcf.ui.i18n.common' for locale 'en_GB': Can't find bundle for base name org.apache.manifoldcf.ui.i18n.common, locale en_GB; trying en java.util.MissingResourceException: Can't find bundle for base name org.apache.manifoldcf.ui.i18n.common, locale en_GB at java.base/java.util.ResourceBundle.throwMissingResourceException(Unknown Source) ~[?:?] at java.base/java.util.ResourceBundle.getBundleImpl(Unknown Source) ~[?:?] at java.base/java.util.ResourceBundle.getBundleImpl(Unknown Source) ~[?:?] at java.base/java.util.ResourceBundle.getBundle(Unknown Source) ~[?:?] at org.apache.manifoldcf.core.i18n.Messages.getResourceBundle(Messages.java:132) [mcf-core.jar:?] at org.apache.manifoldcf.core.i18n.Messages.getMessage(Messages.java:178) [mcf-core.jar:?] at org.apache.manifoldcf.core.i18n.Messages.getString(Messages.java:216) [mcf-core.jar:?] at org.apache.manifoldcf.ui.i18n.Messages.getBodyJavascriptString(Messages.java:343) [mcf-ui-core.jar:?] at org.apache.manifoldcf.ui.i18n.Messages.getBodyJavascriptString(Messages.java:119) [mcf-ui-core.jar:?] at org.apache.manifoldcf.ui.i18n.Messages.getBodyJavascriptString(Messages.java:67) [mcf-ui-core.jar:?] at org.apache.jsp.index_jsp._jspService(index_jsp.java:212) [jsp/:?] Is this can be resolved after adding any resource files or any other solution has to be opted? On Wed, Nov 21, 2018 at 5:36 PM Karl Wright wrote: > Hi Nikita, > > The Tika transformer may well generate a language attribute. You would > need to check with Tika, though, to know for sure, and under what > conditions it might generate this. It should not be confused with document > format detection, which Tika definitely does in order to extract content. > > It looks like language detection in Tika either comes from document > metadata already present, or via a Java interface that you need to > explicitly call to get it. If your documents need the latter, the Tika > connector does not currently do this: > > https://tika.apache.org/1.19.1/detection.html#Language_Detection > > and > > https://tika.apache.org/1.19.1/examples.html#Language_Identification > > The documentation does not clarify whether a language attribute is > actually generated; the architecture seems more suited to plug in machine > translators for your content. I suspect you would need to run the output > of the Tika translator into the NullOutputConnector in order to see what > attributes are being generated to know for sure. > > Karl > > > On Wed, Nov 21, 2018 at 4:45 AM Nikita Ahuja wrote: > >> HI All, >> >> Thanks for the timely replies. But I am basically concerned for the >> language detection of the .doc,.pdf or any other data present in the >> repository. >> >> As per my understanding Tika Transformation provides functionality for >> the same. >> But there is no output for the language of the documents. >> >> The sequence used is: >> 1. Repoistory Connector(Web) >> 2. Tika Transformation >> 3. MetaData Adjuster >> 4.Output Connector(Elastic) >> >> Is there anything which is being missed here for the language detection >> of the documents? >> >> >> >> >> >> On Wed, Nov 21, 2018 at 2:35 PM Furkan KAMACI >> wrote: >> >>> Hi Nikita, >>> >>> First of all, OpenNLP is a transformation connector at ManifoldCF and >>> should be enabled by default. It extracts named entities (people, locations >>> and organizations) from document. >>> >>> You should download trained models to run OpenNLP connector. You can >>> check here for such purpose: https://opennlp.apache.org/models.html >>> >>> Check here for a detailed explanation: >>> https://github.com/ChalithaUdara/OpenNLP-Manifold-Connector >>> >>> Feel free to ask any questions when you try to integrate it. Also, you >>> should explain the points if you cannot success to run it. >>> >>> Kind Regards, >>> Furkan KAMACI >>> >>> >>> On Wed, Nov 21, 2018 at 11:54 AM Karl Wright wrote: >>> >>>> Hi Nikita, >>>> >>>> Can you be more specific when you say "OpenNLP is not working"? All >>>> that this connector does is integrate OpenNLP as a ManifoldCF transformer. >>>> It uses a specific directory to deliver the models that OpenNLP uses to >>>> match and extract content from documents. Thus, you can provide any models >>>> you want that are compatible with the OpenNLP version we're including. >>>> >>>> Can you describe the steps you are taking and what you are seeing? >>>> >>>> On Wed, Nov 21, 2018 at 12:44 AM Nikita Ahuja >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> I have query related to detect the language of the records/data which >>>>> is going to be ingest in the Output Connector. >>>>> >>>>> OpenNLP connector is not working for the detection as per the user >>>>> documentation, but this is not working appropriately. Please suggest is NLP >>>>> has to be used if yes, then how it should be used or is there any other >>>>> solution for this? >>>>> >>>>> -- >>>>> Thanks and Regards, >>>>> Nikita >>>>> Email: nikita@smartshore.nl >>>>> United Sources Service Pvt. Ltd. >>>>> a "Smartshore" Company >>>>> Mobile: +91 99 888 57720 >>>>> http://www.smartshore.nl >>>>> >>>> >> >> -- >> Thanks and Regards, >> Nikita >> Email: nikita@smartshore.nl >> United Sources Service Pvt. Ltd. >> a "Smartshore" Company >> Mobile: +91 99 888 57720 >> http://www.smartshore.nl >> > -- Thanks and Regards, Nikita Email: nikita@smartshore.nl United Sources Service Pvt. Ltd. a "Smartshore" Company Mobile: +91 99 888 57720 http://www.smartshore.nl --000000000000b23977057ccff7d0 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Karl,

Thanks for the suggestion and = Language for the data and content is able to detect now. But there is one i= ssue while ingesting the records in the ElasticSearch Index. and it is stor= ed there in the log file as:

ERROR 2018-12-11T19:19:37,637 (qtp348148678-561) - Missing resource b= undle 'org.apache.manifoldcf.ui.i18n.common' for locale 'en_GB&= #39;: Can't find bundle for base name org.apache.manifoldcf.ui.i18n.com= mon, locale en_GB; trying en
java.util.MissingResourceException: = Can't find bundle for base name org.apache.manifoldcf.ui.i18n.common, l= ocale en_GB
=C2=A0=C2=A0=C2=A0=C2=A0at java.base/java.util.Resour= ceBundle.throwMissingResourceException(Unknown Source) ~[?:?]
=C2= =A0=C2=A0=C2=A0=C2=A0at java.base/java.util.ResourceBundle.getBundleImpl(Un= known Source) ~[?:?]
=C2=A0=C2=A0=C2=A0=C2=A0at java.base/java.ut= il.ResourceBundle.getBundleImpl(Unknown Source) ~[?:?]
=C2=A0=C2= =A0=C2=A0=C2=A0at java.base/java.util.ResourceBundle.getBundle(Unknown Sour= ce) ~[?:?]
=C2=A0=C2=A0=C2=A0=C2=A0at org.apache.manifoldcf.core.= i18n.Messages.getResourceBundle(Messages.java:132) [mcf-core.jar:?]
=C2=A0=C2=A0=C2=A0=C2=A0at org.apache.manifoldcf.core.i18n.Messages.getM= essage(Messages.java:178) [mcf-core.jar:?]
=C2=A0=C2=A0=C2=A0=C2= =A0at org.apache.manifoldcf.core.i18n.Messages.getString(Messages.java:216)= [mcf-core.jar:?]
=C2=A0=C2=A0=C2=A0=C2=A0at org.apache.manifoldc= f.ui.i18n.Messages.getBodyJavascriptString(Messages.java:343) [mcf-ui-core.= jar:?]
=C2=A0=C2=A0=C2=A0=C2=A0at org.apache.manifoldcf.ui.i18n.M= essages.getBodyJavascriptString(Messages.java:119) [mcf-ui-core.jar:?]
=C2=A0=C2=A0=C2=A0=C2=A0at org.apache.manifoldcf.ui.i18n.Messages.get= BodyJavascriptString(Messages.java:67) [mcf-ui-core.jar:?]
=C2=A0= =C2=A0=C2=A0=C2=A0at org.apache.jsp.index_jsp._jspService(index_jsp.java:21= 2) [jsp/:?]


= Is this can be resolved after adding any resource files or any other soluti= on has to be opted?

On Wed, Nov 21, 2018 at 5:36 PM Karl Wright <daddywri@gmail.com> wrote:
Hi Nikita,

The Tika transformer may well genera= te a language attribute.=C2=A0 You would need to check with Tika, though, t= o know for sure, and under what conditions it might generate this.=C2=A0 It= should not be confused with document format detection, which Tika definite= ly does in order to extract content.

It looks like language detectio= n in Tika either comes from document metadata already present, or via a Jav= a interface that you need to explicitly call to get it.=C2=A0 If your docum= ents need the latter, the Tika connector does not currently do this:
https://tika.apache.org/1.19.1/detection.html#Language= _Detection

and

https://tika.apach= e.org/1.19.1/examples.html#Language_Identification

The documenta= tion does not clarify whether a language attribute is actually generated; t= he architecture seems more suited to plug in machine translators for your c= ontent.=C2=A0 I suspect you would need to run the output of the Tika transl= ator into the NullOutputConnector in order to see what attributes are being= generated to know for sure.

Karl


On Wed, Nov 21, 2018 at= 4:45 AM Nikita Ahuja <nikita@smartshore.nl> wrote:
HI All,

Th= anks for the timely replies. But I am basically concerned for the language = detection of the .doc,.pdf or any other data present in the repository.
=
As per my understanding Tika Transformation provides functionality for = the same.=C2=A0
But there is no output for the language of the documents= .

The sequence used is:
1. Repoistory Connector(Web)
=
2. Tika Transformation
3. MetaData Adjuster
4.Outp= ut Connector(Elastic)

Is there anything which is being missed= here for the language detection of the documents?





On= Wed, Nov 21, 2018 at 2:35 PM Furkan KAMACI <furkankamaci@gmail.com> wrote:
<= /div>
Hi Nikita,

First of all, OpenNLP is a transformation connector at ManifoldCF and sh= ould be enabled by default.=C2=A0It extracts named entities (people, locati= ons and organizations) from document.

You should d= ownload trained models to run OpenNLP connector. You can check here for suc= h purpose:=C2=A0https://opennlp.apache.org/models.html

=
Check here for a detailed explanation:=C2=A0https://git= hub.com/ChalithaUdara/OpenNLP-Manifold-Connector

Feel free to ask any questions when you try to integrate it. Also, you s= hould explain the points if you cannot success to run it.

Kind Regards,
Furkan KAMACI


On Wed, Nov = 21, 2018 at 11:54 AM Karl Wright <daddywri@gmail.com> wrote:
Hi Nikita,

Can you be more specific when you say "OpenNLP is not working&= quot;?=C2=A0 All that this connector does is integrate OpenNLP as a Manifol= dCF transformer.=C2=A0 It uses a specific directory to deliver the models t= hat OpenNLP uses to match and extract content from documents.=C2=A0 Thus, y= ou can provide any models you want that are compatible with the OpenNLP ver= sion we're including.

Can you describe the steps you are taking = and what you are seeing?

On Wed, Nov 21, 2018 at 12:44 AM Nikita Ahuja <nikita@smartshore.nl> wrot= e:
Hi,

I have query related to detect the language of t= he records/data which is going to be ingest in the Output Connector.
OpenNLP connector is not working for the detection as per the user documen= tation, but this is not working appropriately. Please suggest is NLP has to= be used if yes, then how it should be used or is there any other solution = for=C2=A0this?

--
Than= ks and Regards,
Nikita
United Sources Service Pvt. Ltd.<= /span>
a "Smartshore"= ; Company
Mobile: +91 99 888 57720

http://www.smart= shore.nl


--
Thanks and Regards,
Nikita<= /font>
United Sources S= ervice Pvt. Ltd.
a &quo= t;Smartshore" Company
Mobile: +91 99 888 57720
<= /font>
h= ttp://www.smartshore.nl


--
Thanks and Regards,
Nikita
Email: nikita@smartshore.nl
<= /div>
United Sources Service Pvt. Ltd.
a "Smartshore" Company
Mobile: = +91 99 888 57720

http://www.smartshore.nl
--000000000000b23977057ccff7d0--