manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Create a new ACTIVITY_FETCH from a transformation
Date Thu, 26 Jul 2018 10:10:46 GMT
ManifoldCF has the concept of "compound document", but all the independent
"components" of the document must be identified at the root level (that is,
in the Repository Connector).

I'm therefore afraid there is no good mapping from ManifoldCF concepts to
what you want to do without writing your own Repository Connector.

Karl


On Thu, Jul 26, 2018 at 5:06 AM Gustavo Beneitez <gustavo.beneitez@gmail.com>
wrote:

> Hi Karl,
>
> I made a quick picture of what I really need (attached)
>
>  Certain URLs coming from repository could be split into two: URL1 and
> URL2.
>
> Normal flow acts as only one is present, URL, but writing a new transform
> I could realise also that there is another one: URL2.
> My complain now is: "well, I have URL2 , how can then inject it to the
> flow in order to become a new URL from the repository (and then fetched,
> processed and ingested like others do)?".
>
> Thanks.
>
>
>
> El jue., 26 jul. 2018 a las 0:35, Karl Wright (<daddywri@gmail.com>)
> escribió:
>
>> The crawled URL is transmitted as part of the RepositoryDocument object to
>> the output connector.  If this is going to Solr, it's used as the
>> document's ID.  You can therefore customize Solr (or ElasticSearch) to
>> extract the data you need at the indexing end.
>>
>> If this doesn't make any sense to you, then please be more specific about
>> what the disposition of each crawled document is.
>>
>> Thanks,
>> Karl
>>
>>
>> On Wed, Jul 25, 2018 at 5:57 PM Gustavo Beneitez <
>> gustavo.beneitez@gmail.com>
>> wrote:
>>
>> > Hi all,
>> >
>> > I need to extract and analyse crawled urls because they may contain
>> certain
>> > parameters such as "?redirectURL=" that could point to new Documents to
>> be
>> > fetched and indexed.
>> >
>> > First I was trying to create a subclass that extends
>> >
>> > public class RedirectExtractor extends
>> > org.apache.manifoldcf.agents.transformation.BaseTransformationConnector
>> >
>> > and add a "RedirectExtractor" transformation step to the fetch process
>> in
>> > ManifoldCF, but it only allows me to modify current Document, not to
>> create
>> > a new FETCH from the extracted parameter.
>> >
>> > I was investigating manifoldCF source code and I found something that
>> may
>> > be in hand
>> >
>> > activities.recordActivity(null,ACTIVITY_FETCH,
>> >                 null,urlValue,Integer.toString(-2),"Robots
>> > exclusion",null);
>> >
>> > from the IProcessActivity interface, which is used by the Connectors. I
>> > didn't want to create a new connector since it is a bit complex but, do
>> you
>> > see an alternative or this is the only way?
>> >
>> > Thanks in advance.
>> >
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message