Hi Nikita,The Tika transformer may well generate a language attribute. You would need to check with Tika, though, to know for sure, and under what conditions it might generate this. It should not be confused with document format detection, which Tika definitely does in order to extract content.
It looks like language detection in Tika either comes from document metadata already present, or via a Java interface that you need to explicitly call to get it. If your documents need the latter, the Tika connector does not currently do this:
The documentation does not clarify whether a language attribute is actually generated; the architecture seems more suited to plug in machine translators for your content. I suspect you would need to run the output of the Tika translator into the NullOutputConnector in order to see what attributes are being generated to know for sure.
KarlOn Wed, Nov 21, 2018 at 4:45 AM Nikita Ahuja <firstname.lastname@example.org> wrote:HI All,Thanks for the timely replies. But I am basically concerned for the language detection of the .doc,.pdf or any other data present in the repository.
As per my understanding Tika Transformation provides functionality for the same.
But there is no output for the language of the documents.
The sequence used is:1. Repoistory Connector(Web)2. Tika Transformation3. MetaData Adjuster4.Output Connector(Elastic)
Is there anything which is being missed here for the language detection of the documents?On Wed, Nov 21, 2018 at 2:35 PM Furkan KAMACI <email@example.com> wrote:Hi Nikita,First of all, OpenNLP is a transformation connector at ManifoldCF and should be enabled by default. It extracts named entities (people, locations and organizations) from document.You should download trained models to run OpenNLP connector. You can check here for such purpose: https://opennlp.apache.org/models.htmlCheck here for a detailed explanation: https://github.com/ChalithaUdara/OpenNLP-Manifold-ConnectorFeel free to ask any questions when you try to integrate it. Also, you should explain the points if you cannot success to run it.Kind Regards,Furkan KAMACIOn Wed, Nov 21, 2018 at 11:54 AM Karl Wright <firstname.lastname@example.org> wrote:Hi Nikita,Can you be more specific when you say "OpenNLP is not working"? All that this connector does is integrate OpenNLP as a ManifoldCF transformer. It uses a specific directory to deliver the models that OpenNLP uses to match and extract content from documents. Thus, you can provide any models you want that are compatible with the OpenNLP version we're including.
Can you describe the steps you are taking and what you are seeing?On Wed, Nov 21, 2018 at 12:44 AM Nikita Ahuja <email@example.com> wrote:Hi,I have query related to detect the language of the records/data which is going to be ingest in the Output Connector.
OpenNLP connector is not working for the detection as per the user documentation, but this is not working appropriately. Please suggest is NLP has to be used if yes, then how it should be used or is there any other solution for this?----