manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From K McGonigal <kmcgon...@gmail.com>
Subject Re: Trouble indexing a Twitter search in RSS format
Date Tue, 16 Aug 2011 19:57:58 GMT
Yes, this agrees. Thank you for all your help and patience.

Kate

On Tue, Aug 16, 2011 at 4:44 AM, Karl Wright <daddywri@gmail.com> wrote:

> Using your twitter RSS feed, dechromed mode="description", and chromed
> mode="skip", and turning off robots exclusion, I get a number of
> indexing operations. The following Solr log output corresponds to one
> such:
>
> INFO: {add=[http://twitter.com/DraRositaperez/statuses/103103998965456896]}
> 0 2
> Aug 16, 2011 5:28:52 AM org.apache.solr.core.SolrCore execute
> INFO: [] webapp=/solr path=/update/extract
> params={literal.source=
> http://search.twitter.com/search.rss?q%3DCampylobacter&literal.id=http://twitter.com/DraRositaperez/statuses/103103998965456896&literal.title=RT+@MicrobeWorld:+Campylobacter+bacteria:+Campylobacter+bacteria+are+the+number-one+cause+of+food-related+gastrointestinal+illness...+http://t.co/0Bk8mTm&literal.pubdate=1313416883000
> }
> status=0 QTime=2
>
> The document's source, title, and pubdate seem to all be set.  The
> feed's "description" field is the actual content that is being indexed
> into Solr, so that is not present in the Solr url but should be
> present in the post data.  So the only question, then, is the
> "summary" field.  Looking at the feed itself, I see <title> fields and
> <description> fields, but no <content> fields, so it makes sense that
> there would be no summary metadata.
>
> Hope this helps.  Does this agree with what you are seeing?
> Karl
>
> >
> > For the rest, I suspect that you have been running the same job over
> > and over again to get the results you describe.  However, you should
> > be aware that ManifoldCF is an incremental crawler.  It will NOT
> > reindex content that has not changed between job runs.
> >
> > So the only result that is definitely weird is:
> >
> >> case 4)  "Dechromed content, if present, in 'description' field" and
> "Never
> >> use chromed content"
> >>                      --> Ingests but both "description" and "summary"
> fields
> >> ARE EMPTY in Solr
> >
>

Mime
View raw message