manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: ManifoldCF and ElasticSearch
Date Tue, 01 Dec 2015 23:45:20 GMT
Base64 encoding is fine when used with mapper-attachment plugin.  And yes,
I recommended Tika transformer.
Thanks,
Karl


On Tue, Dec 1, 2015 at 6:42 PM, Shinichiro Abe <shinichiro.abe.1@gmail.com>
wrote:

> What about https://issues.apache.org/jira/browse/CONNECTORS-1234 to
> avoid Base64 encoding?
> If you want to capture the title of html, you could get from Tika
> transformation connector, since Tika will extract metadata such as a
> title.
>
> Shinichiro Abe
>
> 2015-12-02 3:36 GMT+09:00 Karl Wright <daddywri@gmail.com>:
> > I take that back; the only parsing that is done is in the context of
> > determining login pages as part of login sequences.  So content is not
> > parsed at all; it's sent to the output connector intact, along with HTML
> > headers as metadata.  You can, of course, write a transformation
> connector
> > that would pull out the title; the Tika transformation connector may in
> fact
> > do that for you already, but I don't know for sure.
> >
> > Thanks,
> > Karl
> >
> >
> > On Tue, Dec 1, 2015 at 1:32 PM, Karl Wright <daddywri@gmail.com> wrote:
> >>
> >> Hi Stephen,
> >>
> >> The ManifoldCF web connector captures all html content in the body part
> of
> >> an html page, but it does not attempt to separate title content into
> >> specific title metadata at this time.  This is, however, not
> particularly
> >> hard to do, if I recall correctly, but I'd have to look into it in more
> >> detail before I could be certain.
> >>
> >> Thanks,
> >> Karl
> >>
> >>
> >> On Tue, Dec 1, 2015 at 1:09 PM, Corey, Stephen <COREYS@ecu.edu> wrote:
> >>>
> >>> Thanks Karl!
> >>>
> >>>
> >>>
> >>> After creating a new mapping in ES, specifying the ‘file’ field as an
> >>> attachment, I can now search the full text of the web content. That
> part is
> >>> working great now.
> >>>
> >>>
> >>>
> >>> Does MCF capture the page title (in the <title> tag) anywhere?
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> From: Karl Wright [mailto:daddywri@gmail.com]
> >>> Sent: Tuesday, December 1, 2015 11:00 AM
> >>> To: user@manifoldcf.apache.org
> >>> Subject: Re: ManifoldCF and ElasticSearch
> >>>
> >>>
> >>>
> >>> Hi Stephen,
> >>>
> >>>
> >>>
> >>> The integration with ES is supposed to go through the mapper-attachment
> >>> plugin, which at one point did accept Base64-encoded "attachments" and
> index
> >>> them.  This is what's currently implemented in the ElasticSearch output
> >>> connector.
> >>>
> >>>
> >>>
> >>> Unfortunately, however, with ElasticSearch, the level of backwards
> >>> compatibility isn't always what we'd like, so I wouldn't be surprised
> if
> >>> something changed or if you needed special configuration now to do it
> that
> >>> way.  I've been unable to keep up with what ES is doing but I'm happy
> to
> >>> make changes to the output connector if you have information that the
> >>> current implementation is incorrect, and have details about how to
> make it
> >>> work properly in a standard. modern, ES environment.  But I'd start by
> >>> making sure there's actually something broken by looking at the
> >>> mapper-attachment plugin.
> >>>
> >>>
> >>>
> >>> Thanks,
> >>> Karl
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Tue, Dec 1, 2015 at 10:17 AM, Corey, Stephen <COREYS@ecu.edu>
> wrote:
> >>>
> >>> I’m putting together a proof-of-concept for crawling our website
> content
> >>> with MCF, and indexing it with ES. At a basic level, everything seems
> to be
> >>> working. What I’m trying to understand is that when MCF indexes web
> content,
> >>> the HTML is stored inside an object called file in a property called
> >>> _content. When this is added to the ES index, all the HTML is Base64
> >>> encoded. I believe this is preventing ES from property searching the
> field.
> >>>
> >>>
> >>>
> >>> Is this Base64 encoding to be expected, or do I need to change
> something?
> >>>
> >>>
> >>>
> >>> Does anyone have a walkthrough of using MCF to crawl web content, and
> >>> output to ES? I’ve seen many many guides for both systems, but never
> >>> something that combines the two. I’d prefer to avoid using Nutch for
> >>> crawling, since it lacks any UI for management.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> Stephen Corey
> >>>
> >>> Technology Consultant
> >>> East Carolina University
> >>>
> >>> coreys@ecu.edu
> >>>
> >>>
> >>>
> >>>
> >>
> >>
> >
>

Mime
View raw message