manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gustavo Beneitez <gustavo.benei...@gmail.com>
Subject Re: Create a new ACTIVITY_FETCH from a transformation
Date Thu, 26 Jul 2018 10:20:33 GMT
Thanks, I suspected that while I was reviewing the code but I was hoping
there was an alternative :)

Regards.

El jue., 26 jul. 2018 a las 12:11, Karl Wright (<daddywri@gmail.com>)
escribió:

> ManifoldCF has the concept of "compound document", but all the independent
> "components" of the document must be identified at the root level (that is,
> in the Repository Connector).
>
> I'm therefore afraid there is no good mapping from ManifoldCF concepts to
> what you want to do without writing your own Repository Connector.
>
> Karl
>
>
> On Thu, Jul 26, 2018 at 5:06 AM Gustavo Beneitez <
> gustavo.beneitez@gmail.com>
> wrote:
>
> > Hi Karl,
> >
> > I made a quick picture of what I really need (attached)
> >
> >  Certain URLs coming from repository could be split into two: URL1 and
> > URL2.
> >
> > Normal flow acts as only one is present, URL, but writing a new transform
> > I could realise also that there is another one: URL2.
> > My complain now is: "well, I have URL2 , how can then inject it to the
> > flow in order to become a new URL from the repository (and then fetched,
> > processed and ingested like others do)?".
> >
> > Thanks.
> >
> >
> >
> > El jue., 26 jul. 2018 a las 0:35, Karl Wright (<daddywri@gmail.com>)
> > escribió:
> >
> >> The crawled URL is transmitted as part of the RepositoryDocument object
> to
> >> the output connector.  If this is going to Solr, it's used as the
> >> document's ID.  You can therefore customize Solr (or ElasticSearch) to
> >> extract the data you need at the indexing end.
> >>
> >> If this doesn't make any sense to you, then please be more specific
> about
> >> what the disposition of each crawled document is.
> >>
> >> Thanks,
> >> Karl
> >>
> >>
> >> On Wed, Jul 25, 2018 at 5:57 PM Gustavo Beneitez <
> >> gustavo.beneitez@gmail.com>
> >> wrote:
> >>
> >> > Hi all,
> >> >
> >> > I need to extract and analyse crawled urls because they may contain
> >> certain
> >> > parameters such as "?redirectURL=" that could point to new Documents
> to
> >> be
> >> > fetched and indexed.
> >> >
> >> > First I was trying to create a subclass that extends
> >> >
> >> > public class RedirectExtractor extends
> >> >
> org.apache.manifoldcf.agents.transformation.BaseTransformationConnector
> >> >
> >> > and add a "RedirectExtractor" transformation step to the fetch process
> >> in
> >> > ManifoldCF, but it only allows me to modify current Document, not to
> >> create
> >> > a new FETCH from the extracted parameter.
> >> >
> >> > I was investigating manifoldCF source code and I found something that
> >> may
> >> > be in hand
> >> >
> >> > activities.recordActivity(null,ACTIVITY_FETCH,
> >> >                 null,urlValue,Integer.toString(-2),"Robots
> >> > exclusion",null);
> >> >
> >> > from the IProcessActivity interface, which is used by the Connectors.
> I
> >> > didn't want to create a new connector since it is a bit complex but,
> do
> >> you
> >> > see an alternative or this is the only way?
> >> >
> >> > Thanks in advance.
> >> >
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message