manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nikita Ahuja <nik...@smartshore.nl>
Subject Re: Language Detection for the data
Date Tue, 18 Dec 2018 07:07:02 GMT
Thanks Karl,

But I want to know how to add these files, so that such warnings also not
come and a smooth flow is executed.

Is there any way to do that?

Thanks,
Nikita

On Wed, Dec 12, 2018 at 4:47 PM Karl Wright <daddywri@gmail.com> wrote:

> Hi Nikita,
>
> This is occurring because en_GB does not have a translations file.  It's a
> warning and the code falls back to using en_US.
>
> Karl
>
>
> On Wed, Dec 12, 2018 at 4:39 AM Nikita Ahuja <nikita@smartshore.nl> wrote:
>
>> Hi Karl,
>>
>> Thanks for the suggestion and Language for the data and content is able
>> to detect now. But there is one issue while ingesting the records in the
>> ElasticSearch Index. and it is stored there in the log file as:
>>
>> ERROR 2018-12-11T19:19:37,637 (qtp348148678-561) - Missing resource
>> bundle 'org.apache.manifoldcf.ui.i18n.common' for locale 'en_GB': Can't
>> find bundle for base name org.apache.manifoldcf.ui.i18n.common, locale
>> en_GB; trying en
>> java.util.MissingResourceException: Can't find bundle for base name
>> org.apache.manifoldcf.ui.i18n.common, locale en_GB
>>     at
>> java.base/java.util.ResourceBundle.throwMissingResourceException(Unknown
>> Source) ~[?:?]
>>     at java.base/java.util.ResourceBundle.getBundleImpl(Unknown Source)
>> ~[?:?]
>>     at java.base/java.util.ResourceBundle.getBundleImpl(Unknown Source)
>> ~[?:?]
>>     at java.base/java.util.ResourceBundle.getBundle(Unknown Source) ~[?:?]
>>     at
>> org.apache.manifoldcf.core.i18n.Messages.getResourceBundle(Messages.java:132)
>> [mcf-core.jar:?]
>>     at
>> org.apache.manifoldcf.core.i18n.Messages.getMessage(Messages.java:178)
>> [mcf-core.jar:?]
>>     at
>> org.apache.manifoldcf.core.i18n.Messages.getString(Messages.java:216)
>> [mcf-core.jar:?]
>>     at
>> org.apache.manifoldcf.ui.i18n.Messages.getBodyJavascriptString(Messages.java:343)
>> [mcf-ui-core.jar:?]
>>     at
>> org.apache.manifoldcf.ui.i18n.Messages.getBodyJavascriptString(Messages.java:119)
>> [mcf-ui-core.jar:?]
>>     at
>> org.apache.manifoldcf.ui.i18n.Messages.getBodyJavascriptString(Messages.java:67)
>> [mcf-ui-core.jar:?]
>>     at org.apache.jsp.index_jsp._jspService(index_jsp.java:212) [jsp/:?]
>>
>>
>> Is this can be resolved after adding any resource files or any other
>> solution has to be opted?
>>
>> On Wed, Nov 21, 2018 at 5:36 PM Karl Wright <daddywri@gmail.com> wrote:
>>
>>> Hi Nikita,
>>>
>>> The Tika transformer may well generate a language attribute.  You would
>>> need to check with Tika, though, to know for sure, and under what
>>> conditions it might generate this.  It should not be confused with document
>>> format detection, which Tika definitely does in order to extract content.
>>>
>>> It looks like language detection in Tika either comes from document
>>> metadata already present, or via a Java interface that you need to
>>> explicitly call to get it.  If your documents need the latter, the Tika
>>> connector does not currently do this:
>>>
>>> https://tika.apache.org/1.19.1/detection.html#Language_Detection
>>>
>>> and
>>>
>>> https://tika.apache.org/1.19.1/examples.html#Language_Identification
>>>
>>> The documentation does not clarify whether a language attribute is
>>> actually generated; the architecture seems more suited to plug in machine
>>> translators for your content.  I suspect you would need to run the output
>>> of the Tika translator into the NullOutputConnector in order to see what
>>> attributes are being generated to know for sure.
>>>
>>> Karl
>>>
>>>
>>> On Wed, Nov 21, 2018 at 4:45 AM Nikita Ahuja <nikita@smartshore.nl>
>>> wrote:
>>>
>>>> HI All,
>>>>
>>>> Thanks for the timely replies. But I am basically concerned for the
>>>> language detection of the .doc,.pdf or any other data present in the
>>>> repository.
>>>>
>>>> As per my understanding Tika Transformation provides functionality for
>>>> the same.
>>>> But there is no output for the language of the documents.
>>>>
>>>> The sequence used is:
>>>> 1. Repoistory Connector(Web)
>>>> 2. Tika Transformation
>>>> 3. MetaData Adjuster
>>>> 4.Output Connector(Elastic)
>>>>
>>>> Is there anything which is being missed here for the language detection
>>>> of the documents?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Nov 21, 2018 at 2:35 PM Furkan KAMACI <furkankamaci@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Nikita,
>>>>>
>>>>> First of all, OpenNLP is a transformation connector at ManifoldCF and
>>>>> should be enabled by default. It extracts named entities (people, locations
>>>>> and organizations) from document.
>>>>>
>>>>> You should download trained models to run OpenNLP connector. You can
>>>>> check here for such purpose: https://opennlp.apache.org/models.html
>>>>>
>>>>> Check here for a detailed explanation:
>>>>> https://github.com/ChalithaUdara/OpenNLP-Manifold-Connector
>>>>>
>>>>> Feel free to ask any questions when you try to integrate it. Also, you
>>>>> should explain the points if you cannot success to run it.
>>>>>
>>>>> Kind Regards,
>>>>> Furkan KAMACI
>>>>>
>>>>>
>>>>> On Wed, Nov 21, 2018 at 11:54 AM Karl Wright <daddywri@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Nikita,
>>>>>>
>>>>>> Can you be more specific when you say "OpenNLP is not working"? 
All
>>>>>> that this connector does is integrate OpenNLP as a ManifoldCF transformer.
>>>>>> It uses a specific directory to deliver the models that OpenNLP uses
to
>>>>>> match and extract content from documents.  Thus, you can provide
any models
>>>>>> you want that are compatible with the OpenNLP version we're including.
>>>>>>
>>>>>> Can you describe the steps you are taking and what you are seeing?
>>>>>>
>>>>>> On Wed, Nov 21, 2018 at 12:44 AM Nikita Ahuja <nikita@smartshore.nl>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I have query related to detect the language of the records/data
>>>>>>> which is going to be ingest in the Output Connector.
>>>>>>>
>>>>>>> OpenNLP connector is not working for the detection as per the
user
>>>>>>> documentation, but this is not working appropriately. Please
suggest is NLP
>>>>>>> has to be used if yes, then how it should be used or is there
any other
>>>>>>> solution for this?
>>>>>>>
>>>>>>> --
>>>>>>> Thanks and Regards,
>>>>>>> Nikita
>>>>>>> Email: nikita@smartshore.nl
>>>>>>> United Sources Service Pvt. Ltd.
>>>>>>> a "Smartshore" Company
>>>>>>> Mobile: +91 99 888 57720
>>>>>>> http://www.smartshore.nl
>>>>>>>
>>>>>>
>>>>
>>>> --
>>>> Thanks and Regards,
>>>> Nikita
>>>> Email: nikita@smartshore.nl
>>>> United Sources Service Pvt. Ltd.
>>>> a "Smartshore" Company
>>>> Mobile: +91 99 888 57720
>>>> http://www.smartshore.nl
>>>>
>>>
>>
>> --
>> Thanks and Regards,
>> Nikita
>> Email: nikita@smartshore.nl
>> United Sources Service Pvt. Ltd.
>> a "Smartshore" Company
>> Mobile: +91 99 888 57720
>> http://www.smartshore.nl
>>
>

-- 
Thanks and Regards,
Nikita
Email: nikita@smartshore.nl
United Sources Service Pvt. Ltd.
a "Smartshore" Company
Mobile: +91 99 888 57720
http://www.smartshore.nl

Mime
View raw message