manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Roman Šitina <ro...@sitina.cz>
Subject Re: ManifoldCF and ElasticSearch
Date Tue, 01 Dec 2015 16:00:39 GMT
Hello,

map the file field using mapper-attachments plugin -
https://github.com/elastic/elasticsearch-mapper-attachments

Roman

On 1 December 2015 at 16:17, Corey, Stephen <COREYS@ecu.edu> wrote:
> I’m putting together a proof-of-concept for crawling our website content
> with MCF, and indexing it with ES. At a basic level, everything seems to be
> working. What I’m trying to understand is that when MCF indexes web content,
> the HTML is stored inside an object called file in a property called
> _content. When this is added to the ES index, all the HTML is Base64
> encoded. I believe this is preventing ES from property searching the field.
>
>
>
> Is this Base64 encoding to be expected, or do I need to change something?
>
>
>
> Does anyone have a walkthrough of using MCF to crawl web content, and output
> to ES? I’ve seen many many guides for both systems, but never something that
> combines the two. I’d prefer to avoid using Nutch for crawling, since it
> lacks any UI for management.
>
>
>
>
>
> Stephen Corey
>
> Technology Consultant
> East Carolina University
>
> coreys@ecu.edu
>
>

Mime
View raw message