manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Create a new ACTIVITY_FETCH from a transformation
Date Wed, 25 Jul 2018 22:35:11 GMT
The crawled URL is transmitted as part of the RepositoryDocument object to
the output connector.  If this is going to Solr, it's used as the
document's ID.  You can therefore customize Solr (or ElasticSearch) to
extract the data you need at the indexing end.

If this doesn't make any sense to you, then please be more specific about
what the disposition of each crawled document is.

Thanks,
Karl


On Wed, Jul 25, 2018 at 5:57 PM Gustavo Beneitez <gustavo.beneitez@gmail.com>
wrote:

> Hi all,
>
> I need to extract and analyse crawled urls because they may contain certain
> parameters such as "?redirectURL=" that could point to new Documents to be
> fetched and indexed.
>
> First I was trying to create a subclass that extends
>
> public class RedirectExtractor extends
> org.apache.manifoldcf.agents.transformation.BaseTransformationConnector
>
> and add a "RedirectExtractor" transformation step to the fetch process in
> ManifoldCF, but it only allows me to modify current Document, not to create
> a new FETCH from the extracted parameter.
>
> I was investigating manifoldCF source code and I found something that may
> be in hand
>
> activities.recordActivity(null,ACTIVITY_FETCH,
>                 null,urlValue,Integer.toString(-2),"Robots
> exclusion",null);
>
> from the IProcessActivity interface, which is used by the Connectors. I
> didn't want to create a new connector since it is a bit complex but, do you
> see an alternative or this is the only way?
>
> Thanks in advance.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message