From dev-return-18209-archive-asf-public=cust-asf.ponee.io@manifoldcf.apache.org Thu Jul 26 12:20:57 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id EA966180621 for ; Thu, 26 Jul 2018 12:20:56 +0200 (CEST) Received: (qmail 5250 invoked by uid 500); 26 Jul 2018 10:20:56 -0000 Mailing-List: contact dev-help@manifoldcf.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@manifoldcf.apache.org Delivered-To: mailing list dev@manifoldcf.apache.org Received: (qmail 5237 invoked by uid 99); 26 Jul 2018 10:20:55 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 26 Jul 2018 10:20:55 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id F055B18071F for ; Thu, 26 Jul 2018 10:20:54 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.888 X-Spam-Level: * X-Spam-Status: No, score=1.888 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001, T_DKIMWL_WL_MED=-0.01] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id fJIGKnJhbAIp for ; Thu, 26 Jul 2018 10:20:53 +0000 (UTC) Received: from mail-ed1-f44.google.com (mail-ed1-f44.google.com [209.85.208.44]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 203205F107 for ; Thu, 26 Jul 2018 10:20:53 +0000 (UTC) Received: by mail-ed1-f44.google.com with SMTP id e19-v6so1059378edq.7 for ; Thu, 26 Jul 2018 03:20:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=rzoOkSZr7dAfYdlVFk/Tn9y4RVofQyHoZYmKnzsTB2I=; b=Et3YYW3TU3uMKcu4GVMINSE7B7E1vGyP31MBy/fyVSCgNBF7jMY2k9M5/xoI0byCS7 SyhhElH+D4EKKZeK5WL37vr/w+0PYJzW7sYeruNdtLARfBR4tS1zCDeRmNMCAGxXYqlt OgALBLojduZro5mO3mm6y15kD7bA9A++o0M+Rm/W+VDZ6PTQqx8WM0423rc5AYVLy17J S27rFYWa2ZdMMksJkoZm++JHLvEvpUB1MF4v3QTpedDsqf7iht0D5akPCGjzCTqvwuTz MhHRUuRtJIt+2dDZk5flURWdPkg/yHXvhL8GWi9ZP69A/xRd231nJLym4w9XO1f/384E u8fQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=rzoOkSZr7dAfYdlVFk/Tn9y4RVofQyHoZYmKnzsTB2I=; b=JrYkWwhlAYMT5/EA7WBg0y26OOvQKIouKVfaRkLiaiXlbeQbc0nQS9UqMEdOOjSs/r ZzdzNRXmjknMywGrrZ0TLB2id3laIBM8PxZlenHrXUahQAIuQX4Dz0VCY79/Nxm6L3Dw 8iJhF40Ocsb9z97QPLkCHoQwYpTeA83AivmKd6+uJ69C1fp2YcQurnbVoSpGQIpHcbRP oaH8Na6xmPpARQyofer43iFaGIZfTJ8OJseScPe7UUMhj+MsQ+IlaQcRTym7bkQgtuI3 Uh0DyKdXx+NvSwv4lYXd/51hynC6NrOLdYIKqSF94ilhdJq8mKecqTofn8EVmBh+EnJ+ frBg== X-Gm-Message-State: AOUpUlHLjk4y+DaeWgboXXoj6P5fTi3t7W9D47u2bL87eEftOIP4oEvz mibqD4vyE+qD2DUrKYotoddy8JBNarKZaJvMiqFKUQ== X-Google-Smtp-Source: AAOMgpfN26WlpcbTYVnja3RBeQ/B5f7kFSp7YaFYo5/fR2N4SGAsTb0//reqMhP7PTQYuMvhMcV4lz4MKHAxDMcEEvQ= X-Received: by 2002:a50:b045:: with SMTP id i63-v6mr2113960edd.18.1532600445660; Thu, 26 Jul 2018 03:20:45 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Gustavo Beneitez Date: Thu, 26 Jul 2018 12:20:33 +0200 Message-ID: Subject: Re: Create a new ACTIVITY_FETCH from a transformation To: dev@manifoldcf.apache.org Content-Type: multipart/alternative; boundary="0000000000002574c50571e45917" --0000000000002574c50571e45917 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Thanks, I suspected that while I was reviewing the code but I was hoping there was an alternative :) Regards. El jue., 26 jul. 2018 a las 12:11, Karl Wright () escribi=C3=B3: > ManifoldCF has the concept of "compound document", but all the independen= t > "components" of the document must be identified at the root level (that i= s, > in the Repository Connector). > > I'm therefore afraid there is no good mapping from ManifoldCF concepts to > what you want to do without writing your own Repository Connector. > > Karl > > > On Thu, Jul 26, 2018 at 5:06 AM Gustavo Beneitez < > gustavo.beneitez@gmail.com> > wrote: > > > Hi Karl, > > > > I made a quick picture of what I really need (attached) > > > > Certain URLs coming from repository could be split into two: URL1 and > > URL2. > > > > Normal flow acts as only one is present, URL, but writing a new transfo= rm > > I could realise also that there is another one: URL2. > > My complain now is: "well, I have URL2 , how can then inject it to the > > flow in order to become a new URL from the repository (and then fetched= , > > processed and ingested like others do)?". > > > > Thanks. > > > > > > > > El jue., 26 jul. 2018 a las 0:35, Karl Wright () > > escribi=C3=B3: > > > >> The crawled URL is transmitted as part of the RepositoryDocument objec= t > to > >> the output connector. If this is going to Solr, it's used as the > >> document's ID. You can therefore customize Solr (or ElasticSearch) to > >> extract the data you need at the indexing end. > >> > >> If this doesn't make any sense to you, then please be more specific > about > >> what the disposition of each crawled document is. > >> > >> Thanks, > >> Karl > >> > >> > >> On Wed, Jul 25, 2018 at 5:57 PM Gustavo Beneitez < > >> gustavo.beneitez@gmail.com> > >> wrote: > >> > >> > Hi all, > >> > > >> > I need to extract and analyse crawled urls because they may contain > >> certain > >> > parameters such as "?redirectURL=3D" that could point to new Documen= ts > to > >> be > >> > fetched and indexed. > >> > > >> > First I was trying to create a subclass that extends > >> > > >> > public class RedirectExtractor extends > >> > > org.apache.manifoldcf.agents.transformation.BaseTransformationConnector > >> > > >> > and add a "RedirectExtractor" transformation step to the fetch proce= ss > >> in > >> > ManifoldCF, but it only allows me to modify current Document, not to > >> create > >> > a new FETCH from the extracted parameter. > >> > > >> > I was investigating manifoldCF source code and I found something tha= t > >> may > >> > be in hand > >> > > >> > activities.recordActivity(null,ACTIVITY_FETCH, > >> > null,urlValue,Integer.toString(-2),"Robots > >> > exclusion",null); > >> > > >> > from the IProcessActivity interface, which is used by the Connectors= . > I > >> > didn't want to create a new connector since it is a bit complex but, > do > >> you > >> > see an alternative or this is the only way? > >> > > >> > Thanks in advance. > >> > > >> > > > --0000000000002574c50571e45917--