manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Corey, Stephen" <COR...@ecu.edu>
Subject ManifoldCF and ElasticSearch
Date Tue, 01 Dec 2015 15:17:42 GMT
I'm putting together a proof-of-concept for crawling our website content with MCF, and indexing
it with ES. At a basic level, everything seems to be working. What I'm trying to understand
is that when MCF indexes web content, the HTML is stored inside an object called file in a
property called _content. When this is added to the ES index, all the HTML is Base64 encoded.
I believe this is preventing ES from property searching the field.

Is this Base64 encoding to be expected, or do I need to change something?

Does anyone have a walkthrough of using MCF to crawl web content, and output to ES? I've seen
many many guides for both systems, but never something that combines the two. I'd prefer to
avoid using Nutch for crawling, since it lacks any UI for management.


Stephen Corey
Technology Consultant
East Carolina University
coreys@ecu.edu<mailto:coreys@ecu.edu>


Mime
View raw message