Return-Path: X-Original-To: apmail-incubator-connectors-user-archive@minotaur.apache.org Delivered-To: apmail-incubator-connectors-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 705BE7975 for ; Mon, 5 Dec 2011 04:14:12 +0000 (UTC) Received: (qmail 87143 invoked by uid 500); 5 Dec 2011 04:14:12 -0000 Delivered-To: apmail-incubator-connectors-user-archive@incubator.apache.org Received: (qmail 87084 invoked by uid 500); 5 Dec 2011 04:14:11 -0000 Mailing-List: contact connectors-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: connectors-user@incubator.apache.org Delivered-To: mailing list connectors-user@incubator.apache.org Received: (qmail 87064 invoked by uid 99); 5 Dec 2011 04:14:10 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 05 Dec 2011 04:14:10 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [158.201.127.1] (HELO mail.ogis-ri.co.jp) (158.201.127.1) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 05 Dec 2011 04:14:00 +0000 Received: from vcheck.icc.ogis-ri.co.jp (kaede.icc.ogis-ri.co.jp [158.201.123.47]) by mail.ogis-ri.co.jp (Postfix) with ESMTP id 1E9325769F for ; Mon, 5 Dec 2011 13:13:38 +0900 (JST) Received: from vcheck.icc.ogis-ri.co.jp (localhost.localdomain [127.0.0.1]) by localhost.icc.ogis-ri.co.jp (Postfix) with ESMTP id 0898F13C578 for ; Mon, 5 Dec 2011 13:13:38 +0900 (JST) Received: from gwfilter.icc.ogis-ri.co.jp (gwfilter.icc.ogis-ri.co.jp [158.201.123.54]) by vcheck.icc.ogis-ri.co.jp (Postfix) with ESMTP id F207213C386 for ; Mon, 5 Dec 2011 13:13:37 +0900 (JST) Received: by gwfilter.icc.ogis-ri.co.jp (Postfix, from userid 0) id F1044A4A3C; Mon, 5 Dec 2011 13:13:37 +0900 (JST) Received: from unknown [158.201.123.108] by gwfilter.icc.ogis-ri.co.jp with ESMTP id PAA21861; Mon, 5 Dec 2011 13:13:37 +0900 Received: from sekisyo.icc.ogis-ri.co.jp (sekisyo [127.0.0.1]) by sekisyo.icc.ogis-ri.co.jp (8.14.4/8.14.4) with SMTP id pB54Dbcg015284 for ; Mon, 5 Dec 2011 13:13:37 +0900 Received: from [158.201.102.16] (p0472993c.ad.ogis-ri.co.jp [158.201.102.16]) by tayori.ogis-ri.co.jp (MOS 4.1.5-GA) with ESMTP id AGJ90077; Mon, 5 Dec 2011 13:13:08 +0900 Message-ID: <4EDC44D3.2090903@ogis-ri.co.jp> Date: Mon, 05 Dec 2011 13:13:07 +0900 From: Hitoshi Ozawa User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; ja; rv:1.9.1.5) Gecko/20091204 Thunderbird/3.0 MIME-Version: 1.0 To: connectors-user@incubator.apache.org Subject: Re: Export crawled URLs References: <-272202450432125727@unknownmsgid> <4EDC1547.2080503@ogis-ri.co.jp> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-CC-Mail-RelayStamp: CC/Mail Relayed X-Virus-Checked: Checked by ClamAV on apache.org "The interpretation of this field will differ from connector to connector". From the above description, seems the content of entityid is dependent of which connector is being used to crawl the web pages. You're right about the second point on entityid column datatype. In MySQL, which I'm using with ManifoldCF, the datatype of entityid is LONGTEXT. I was just using it figurably even though I just found out that I can actually execute the sql statement. :-) Cheers, H.Ozawa (2011/12/05 10:29), Karl Wright wrote: > Well, the history comes from the repohistory table, yes - but you may > not be able to construct a query with entityid=jobs.id, first of all > because that is incorrect (what the entity field contains is dependent > on the activity type), and secondly because that column is > potentially long and only some kinds of queries can be done against > it. Specifically it cannot be built into an index on PostgreSQL. > > Karl > > On Sun, Dec 4, 2011 at 7:50 PM, Hitoshi Ozawa > wrote: > >> Is "history" just entries in the "repohistory" table with entitityid = >> jobs.id? >> >> H.Ozawa >> >> (2011/12/03 1:43), Karl Wright wrote: >> >>> The best place to get this from is the simple history. A command-line >>> utility to dump this information to a text file should be possible >>> with the currently available interface primitives. If that is how you >>> want to go, you will need to run ManifoldCF in multiprocess mode. >>> Alternatively you might want to request the info from the API, but >>> that's problematic because nobody has implemented report support in >>> the API as of now. >>> >>> A final alternative is to get this from the log. There is an [INFO] >>> level line from the web connector for every fetch, I seem to recall, >>> and you might be able to use that. >>> >>> Thanks, >>> Karl >>> >>> >>> On Fri, Dec 2, 2011 at 11:18 AM, M Kelleher wrote: >>> >>> >>>> Is it possible to export / download the list of URLs visited during a >>>> crawl job? >>>> >>>> Sent from my iPad >>>> >>>> >>> >>> >> >> >> >