manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: ManifoldCF and ElasticSearch
Date Tue, 01 Dec 2015 18:36:57 GMT
I take that back; the only parsing that is done is in the context of
determining login pages as part of login sequences.  So content is not
parsed at all; it's sent to the output connector intact, along with HTML
headers as metadata.  You can, of course, write a transformation connector
that would pull out the title; the Tika transformation connector may in
fact do that for you already, but I don't know for sure.

Thanks,
Karl


On Tue, Dec 1, 2015 at 1:32 PM, Karl Wright <daddywri@gmail.com> wrote:

> Hi Stephen,
>
> The ManifoldCF web connector captures all html content in the body part of
> an html page, but it does not attempt to separate title content into
> specific title metadata at this time.  This is, however, not particularly
> hard to do, if I recall correctly, but I'd have to look into it in more
> detail before I could be certain.
>
> Thanks,
> Karl
>
>
> On Tue, Dec 1, 2015 at 1:09 PM, Corey, Stephen <COREYS@ecu.edu> wrote:
>
>> Thanks Karl!
>>
>>
>>
>> After creating a new mapping in ES, specifying the ‘file’ field as an
>> attachment, I can now search the full text of the web content. That part is
>> working great now.
>>
>>
>>
>> Does MCF capture the page title (in the <title> tag) anywhere?
>>
>>
>>
>>
>>
>>
>>
>>
>> *From:* Karl Wright [mailto:daddywri@gmail.com]
>> *Sent:* Tuesday, December 1, 2015 11:00 AM
>> *To:* user@manifoldcf.apache.org
>> *Subject:* Re: ManifoldCF and ElasticSearch
>>
>>
>>
>> Hi Stephen,
>>
>>
>>
>> The integration with ES is supposed to go through the mapper-attachment
>> plugin, which at one point did accept Base64-encoded "attachments" and
>> index them.  This is what's currently implemented in the ElasticSearch
>> output connector.
>>
>>
>>
>> Unfortunately, however, with ElasticSearch, the level of backwards
>> compatibility isn't always what we'd like, so I wouldn't be surprised if
>> something changed or if you needed special configuration now to do it that
>> way.  I've been unable to keep up with what ES is doing but I'm happy to
>> make changes to the output connector if you have information that the
>> current implementation is incorrect, and have details about how to make it
>> work properly in a standard. modern, ES environment.  But I'd start by
>> making sure there's actually something broken by looking at the
>> mapper-attachment plugin.
>>
>>
>>
>> Thanks,
>> Karl
>>
>>
>>
>>
>>
>> On Tue, Dec 1, 2015 at 10:17 AM, Corey, Stephen <COREYS@ecu.edu> wrote:
>>
>> I’m putting together a proof-of-concept for crawling our website content
>> with MCF, and indexing it with ES. At a basic level, everything seems to be
>> working. What I’m trying to understand is that when MCF indexes web
>> content, the HTML is stored inside an object called file in a property
>> called _content. When this is added to the ES index, all the HTML is Base64
>> encoded. I believe this is preventing ES from property searching the field.
>>
>>
>>
>> Is this Base64 encoding to be expected, or do I need to change something?
>>
>>
>>
>> Does anyone have a walkthrough of using MCF to crawl web content, and
>> output to ES? I’ve seen many many guides for both systems, but never
>> something that combines the two. I’d prefer to avoid using Nutch for
>> crawling, since it lacks any UI for management.
>>
>>
>>
>>
>>
>> Stephen Corey
>>
>> Technology Consultant
>> East Carolina University
>>
>> coreys@ecu.edu
>>
>>
>>
>>
>>
>
>

Mime
View raw message