manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Corey, Stephen" <COR...@ecu.edu>
Subject RE: ManifoldCF and ElasticSearch
Date Tue, 01 Dec 2015 18:09:41 GMT
Thanks Karl!

After creating a new mapping in ES, specifying the ‘file’ field as an attachment, I can
now search the full text of the web content. That part is working great now.

Does MCF capture the page title (in the <title> tag) anywhere?



From: Karl Wright [mailto:daddywri@gmail.com]
Sent: Tuesday, December 1, 2015 11:00 AM
To: user@manifoldcf.apache.org
Subject: Re: ManifoldCF and ElasticSearch

Hi Stephen,

The integration with ES is supposed to go through the mapper-attachment plugin, which at one
point did accept Base64-encoded "attachments" and index them.  This is what's currently implemented
in the ElasticSearch output connector.

Unfortunately, however, with ElasticSearch, the level of backwards compatibility isn't always
what we'd like, so I wouldn't be surprised if something changed or if you needed special configuration
now to do it that way.  I've been unable to keep up with what ES is doing but I'm happy to
make changes to the output connector if you have information that the current implementation
is incorrect, and have details about how to make it work properly in a standard. modern, ES
environment.  But I'd start by making sure there's actually something broken by looking at
the mapper-attachment plugin.

Thanks,
Karl


On Tue, Dec 1, 2015 at 10:17 AM, Corey, Stephen <COREYS@ecu.edu<mailto:COREYS@ecu.edu>>
wrote:
I’m putting together a proof-of-concept for crawling our website content with MCF, and indexing
it with ES. At a basic level, everything seems to be working. What I’m trying to understand
is that when MCF indexes web content, the HTML is stored inside an object called file in a
property called _content. When this is added to the ES index, all the HTML is Base64 encoded.
I believe this is preventing ES from property searching the field.

Is this Base64 encoding to be expected, or do I need to change something?

Does anyone have a walkthrough of using MCF to crawl web content, and output to ES? I’ve
seen many many guides for both systems, but never something that combines the two. I’d prefer
to avoid using Nutch for crawling, since it lacks any UI for management.


Stephen Corey
Technology Consultant
East Carolina University
coreys@ecu.edu<mailto:coreys@ecu.edu>


Mime
View raw message