manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Sending tika extracted metadata, text and original binary content from a document to elasticsearch
Date Sun, 09 Aug 2015 20:48:18 GMT
I'm sorry, the Tika extractor only sends on the extracted text for
indexing, not the binary document.

In order to do what you want, you will need to write your on transformation
connector.

Thanks,
Karl


On Sun, Aug 9, 2015 at 3:06 PM, Mike Caceres <miguel151@hotmail.com> wrote:

> Hello,
>
> Let's say I have a single MS-Word document and would like use ManifoldCF
> to crawl it and to send it to elasticsearch using the Attachment plugin.
>
> Currently I am able to successfully crawl the MS-Word document and to send
> to elasticsearch two things: a) the metadata extracted from the MS-Word
> document by the Tika Content Extractor as well as b) the plain text
> detected by the Boilerplate "Extract Everything".
>
> If I remove the Tika Content Extractor from the pipeline, I am able to
> send the actual binary data from the MS-Word document and the elasticsearch
> Attachment plugin is able to index it, but I do not have as rich metadata
> associated to the document as when I use the Tika Content Extractor.
>
> Now I would like to be able to combine both, so at the end I have in
> elasticseach three things: a) the metadata extracted by the Tika Content
> Extractor b) the plain text of the document and c) the binary data from the
> MS-Word (so I can do downstream processing as needed).
>
> How can I achieve that?
>
> Thank you!
>
> Mike
>

Mime
View raw message