Return-Path: Delivered-To: apmail-incubator-connectors-user-archive@minotaur.apache.org Received: (qmail 46933 invoked from network); 15 Mar 2011 06:57:50 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 15 Mar 2011 06:57:50 -0000 Received: (qmail 53303 invoked by uid 500); 15 Mar 2011 06:57:50 -0000 Delivered-To: apmail-incubator-connectors-user-archive@incubator.apache.org Received: (qmail 53262 invoked by uid 500); 15 Mar 2011 06:57:50 -0000 Mailing-List: contact connectors-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: connectors-user@incubator.apache.org Delivered-To: mailing list connectors-user@incubator.apache.org Received: (qmail 53254 invoked by uid 99); 15 Mar 2011 06:57:49 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 15 Mar 2011 06:57:49 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of daddywri@gmail.com designates 209.85.216.47 as permitted sender) Received: from [209.85.216.47] (HELO mail-qw0-f47.google.com) (209.85.216.47) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 15 Mar 2011 06:57:43 +0000 Received: by qwh5 with SMTP id 5so186395qwh.6 for ; Mon, 14 Mar 2011 23:57:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=5Dfj1PNq5Yp7GvCcraHrUyhmGdjJ+yTMW/BCjeT4cOE=; b=jopz9PMX4QUbRBDQwTJqH5ZHVpJoF93j4ZwC5IKZUJ7euLCi0zrYw/8bVslS/2zKgD pvoJ5Q/hOiFRaczufy0hkdRLUST4meQED85ddjOVYKqPzaePu2agVTMhsLwho0uhnG4V VZABwzxSUnmHPRPZTNR6jDpQ3q9HGVfxBYZ/I= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=wdBLzoMc6nSSxCzoAL6tQgiaLXj0vf6VuNE2cgF8nvX/XoroTC0mzxZBGolMm4IBce VcY3hNKvweV+j2H+O04ZW/azALggSmms+W3c1W7JiwZ87tX+03cPa62oka/qZeXSvwbe 6u1DDsqVaaLZOWyV0Av12gQkTvY230lrVX39A= MIME-Version: 1.0 Received: by 10.229.46.74 with SMTP id i10mr10925667qcf.64.1300172242156; Mon, 14 Mar 2011 23:57:22 -0700 (PDT) Received: by 10.229.218.71 with HTTP; Mon, 14 Mar 2011 23:57:22 -0700 (PDT) In-Reply-To: <008001cbe2ca$ac1e2900$045a7b00$@efendi.ca> References: <003101cbe289$9a9e8cb0$cfdba610$@efendi.ca> <004201cbe28f$e360e550$aa22aff0$@efendi.ca> <004301cbe290$6df4f850$49dee8f0$@efendi.ca> <004901cbe295$960c15d0$c2244170$@efendi.ca> <004d01cbe296$8a170f90$9e452eb0$@efendi.ca> <006e01cbe2b9$80ba6f00$822f4d00$@efendi.ca> <007f01cbe2c9$1c461550$54d23ff0$@efendi.ca> <008001cbe2ca$ac1e2900$045a7b00$@efendi.ca> Date: Tue, 15 Mar 2011 02:57:22 -0400 Message-ID: Subject: Re: SOLR From: Karl Wright To: connectors-user@incubator.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org No, all retrieval is being done by ManifoldCF. Solr Cell does not retrieve it. The RSS connector retrieves content from an RSS feed, which is basically a bunch of references. The feed itself is not indexed, but the documents it refers to are. If those documents, when you bring them up in a browser, have login and navigation information, you may well see these in the index. The RSS connector can be configured to just index the document's description or content information from the feed, and not the document itself, but that requires you to change one of the settings for the job. Karl On Tue, Mar 15, 2011 at 12:37 AM, Fuad Efendi wrote: > Hi Karl, > > My only guess is we submit URI of a document to SOLR Cell, and Solr Cell > retrieves it from Internet (using probably HttpClient and "may be" using own > Robot signature?) > Even in case of RSS... > Only this can explain why I have "navigation" and "login" in SOLR index... > > Am I right? > > > Thanks > > > > -----Original Message----- > From: Fuad Efendi [mailto:fuad@efendi.ca] > Sent: March-15-11 12:26 AM > To: connectors-user@incubator.apache.org > Subject: RE: SOLR > > UPDATE: > SOLR 1.4.1 (june-2010) works fine with ManifoldCF trunk. > SOLR trunk doesn't work, and I suspect bugs in TIKA... > > But it is strange :) > > I am looking at SOLR, each document contains huge array of "links", > including many links to Yahoo login... something weird (it doesn't look like > RSS)... but searchable. > > > -----Original Message----- > From: Fuad Efendi [mailto:fuad@efendi.ca] > Sent: March-14-11 10:50 PM > To: 'connectors-user@incubator.apache.org' > Subject: RE: SOLR > > > I just noticed: > Currently, default for ManifoldCF is /update/extract, which corresponds to > SOLR Cell request handler. > > So... > It is EXTREMELY generic... > http://wiki.apache.org/solr/ExtractingRequestHandler > > >