manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Language Detection for the data
Date Wed, 12 Dec 2018 11:16:41 GMT
Hi Nikita,

This is occurring because en_GB does not have a translations file.  It's a
warning and the code falls back to using en_US.

Karl


On Wed, Dec 12, 2018 at 4:39 AM Nikita Ahuja <nikita@smartshore.nl> wrote:

> Hi Karl,
>
> Thanks for the suggestion and Language for the data and content is able to
> detect now. But there is one issue while ingesting the records in the
> ElasticSearch Index. and it is stored there in the log file as:
>
> ERROR 2018-12-11T19:19:37,637 (qtp348148678-561) - Missing resource bundle
> 'org.apache.manifoldcf.ui.i18n.common' for locale 'en_GB': Can't find
> bundle for base name org.apache.manifoldcf.ui.i18n.common, locale en_GB;
> trying en
> java.util.MissingResourceException: Can't find bundle for base name
> org.apache.manifoldcf.ui.i18n.common, locale en_GB
>     at
> java.base/java.util.ResourceBundle.throwMissingResourceException(Unknown
> Source) ~[?:?]
>     at java.base/java.util.ResourceBundle.getBundleImpl(Unknown Source)
> ~[?:?]
>     at java.base/java.util.ResourceBundle.getBundleImpl(Unknown Source)
> ~[?:?]
>     at java.base/java.util.ResourceBundle.getBundle(Unknown Source) ~[?:?]
>     at
> org.apache.manifoldcf.core.i18n.Messages.getResourceBundle(Messages.java:132)
> [mcf-core.jar:?]
>     at
> org.apache.manifoldcf.core.i18n.Messages.getMessage(Messages.java:178)
> [mcf-core.jar:?]
>     at
> org.apache.manifoldcf.core.i18n.Messages.getString(Messages.java:216)
> [mcf-core.jar:?]
>     at
> org.apache.manifoldcf.ui.i18n.Messages.getBodyJavascriptString(Messages.java:343)
> [mcf-ui-core.jar:?]
>     at
> org.apache.manifoldcf.ui.i18n.Messages.getBodyJavascriptString(Messages.java:119)
> [mcf-ui-core.jar:?]
>     at
> org.apache.manifoldcf.ui.i18n.Messages.getBodyJavascriptString(Messages.java:67)
> [mcf-ui-core.jar:?]
>     at org.apache.jsp.index_jsp._jspService(index_jsp.java:212) [jsp/:?]
>
>
> Is this can be resolved after adding any resource files or any other
> solution has to be opted?
>
> On Wed, Nov 21, 2018 at 5:36 PM Karl Wright <daddywri@gmail.com> wrote:
>
>> Hi Nikita,
>>
>> The Tika transformer may well generate a language attribute.  You would
>> need to check with Tika, though, to know for sure, and under what
>> conditions it might generate this.  It should not be confused with document
>> format detection, which Tika definitely does in order to extract content.
>>
>> It looks like language detection in Tika either comes from document
>> metadata already present, or via a Java interface that you need to
>> explicitly call to get it.  If your documents need the latter, the Tika
>> connector does not currently do this:
>>
>> https://tika.apache.org/1.19.1/detection.html#Language_Detection
>>
>> and
>>
>> https://tika.apache.org/1.19.1/examples.html#Language_Identification
>>
>> The documentation does not clarify whether a language attribute is
>> actually generated; the architecture seems more suited to plug in machine
>> translators for your content.  I suspect you would need to run the output
>> of the Tika translator into the NullOutputConnector in order to see what
>> attributes are being generated to know for sure.
>>
>> Karl
>>
>>
>> On Wed, Nov 21, 2018 at 4:45 AM Nikita Ahuja <nikita@smartshore.nl>
>> wrote:
>>
>>> HI All,
>>>
>>> Thanks for the timely replies. But I am basically concerned for the
>>> language detection of the .doc,.pdf or any other data present in the
>>> repository.
>>>
>>> As per my understanding Tika Transformation provides functionality for
>>> the same.
>>> But there is no output for the language of the documents.
>>>
>>> The sequence used is:
>>> 1. Repoistory Connector(Web)
>>> 2. Tika Transformation
>>> 3. MetaData Adjuster
>>> 4.Output Connector(Elastic)
>>>
>>> Is there anything which is being missed here for the language detection
>>> of the documents?
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Nov 21, 2018 at 2:35 PM Furkan KAMACI <furkankamaci@gmail.com>
>>> wrote:
>>>
>>>> Hi Nikita,
>>>>
>>>> First of all, OpenNLP is a transformation connector at ManifoldCF and
>>>> should be enabled by default. It extracts named entities (people, locations
>>>> and organizations) from document.
>>>>
>>>> You should download trained models to run OpenNLP connector. You can
>>>> check here for such purpose: https://opennlp.apache.org/models.html
>>>>
>>>> Check here for a detailed explanation:
>>>> https://github.com/ChalithaUdara/OpenNLP-Manifold-Connector
>>>>
>>>> Feel free to ask any questions when you try to integrate it. Also, you
>>>> should explain the points if you cannot success to run it.
>>>>
>>>> Kind Regards,
>>>> Furkan KAMACI
>>>>
>>>>
>>>> On Wed, Nov 21, 2018 at 11:54 AM Karl Wright <daddywri@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Nikita,
>>>>>
>>>>> Can you be more specific when you say "OpenNLP is not working"?  All
>>>>> that this connector does is integrate OpenNLP as a ManifoldCF transformer.
>>>>> It uses a specific directory to deliver the models that OpenNLP uses
to
>>>>> match and extract content from documents.  Thus, you can provide any
models
>>>>> you want that are compatible with the OpenNLP version we're including.
>>>>>
>>>>> Can you describe the steps you are taking and what you are seeing?
>>>>>
>>>>> On Wed, Nov 21, 2018 at 12:44 AM Nikita Ahuja <nikita@smartshore.nl>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I have query related to detect the language of the records/data which
>>>>>> is going to be ingest in the Output Connector.
>>>>>>
>>>>>> OpenNLP connector is not working for the detection as per the user
>>>>>> documentation, but this is not working appropriately. Please suggest
is NLP
>>>>>> has to be used if yes, then how it should be used or is there any
other
>>>>>> solution for this?
>>>>>>
>>>>>> --
>>>>>> Thanks and Regards,
>>>>>> Nikita
>>>>>> Email: nikita@smartshore.nl
>>>>>> United Sources Service Pvt. Ltd.
>>>>>> a "Smartshore" Company
>>>>>> Mobile: +91 99 888 57720
>>>>>> http://www.smartshore.nl
>>>>>>
>>>>>
>>>
>>> --
>>> Thanks and Regards,
>>> Nikita
>>> Email: nikita@smartshore.nl
>>> United Sources Service Pvt. Ltd.
>>> a "Smartshore" Company
>>> Mobile: +91 99 888 57720
>>> http://www.smartshore.nl
>>>
>>
>
> --
> Thanks and Regards,
> Nikita
> Email: nikita@smartshore.nl
> United Sources Service Pvt. Ltd.
> a "Smartshore" Company
> Mobile: +91 99 888 57720
> http://www.smartshore.nl
>

Mime
View raw message