manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: ManifoldCF and ElasticSearch
Date Tue, 01 Dec 2015 16:00:26 GMT
Hi Stephen,

The integration with ES is supposed to go through the mapper-attachment
plugin, which at one point did accept Base64-encoded "attachments" and
index them.  This is what's currently implemented in the ElasticSearch
output connector.

Unfortunately, however, with ElasticSearch, the level of backwards
compatibility isn't always what we'd like, so I wouldn't be surprised if
something changed or if you needed special configuration now to do it that
way.  I've been unable to keep up with what ES is doing but I'm happy to
make changes to the output connector if you have information that the
current implementation is incorrect, and have details about how to make it
work properly in a standard. modern, ES environment.  But I'd start by
making sure there's actually something broken by looking at the
mapper-attachment plugin.

Thanks,
Karl


On Tue, Dec 1, 2015 at 10:17 AM, Corey, Stephen <COREYS@ecu.edu> wrote:

> I’m putting together a proof-of-concept for crawling our website content
> with MCF, and indexing it with ES. At a basic level, everything seems to be
> working. What I’m trying to understand is that when MCF indexes web
> content, the HTML is stored inside an object called file in a property
> called _content. When this is added to the ES index, all the HTML is Base64
> encoded. I believe this is preventing ES from property searching the field.
>
>
>
> Is this Base64 encoding to be expected, or do I need to change something?
>
>
>
> Does anyone have a walkthrough of using MCF to crawl web content, and
> output to ES? I’ve seen many many guides for both systems, but never
> something that combines the two. I’d prefer to avoid using Nutch for
> crawling, since it lacks any UI for management.
>
>
>
>
>
> Stephen Corey
>
> Technology Consultant
> East Carolina University
>
> coreys@ecu.edu
>
>
>

Mime
View raw message