manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Trouble indexing a Twitter search in RSS format
Date Tue, 16 Aug 2011 09:44:05 GMT
Using your twitter RSS feed, dechromed mode="description", and chromed
mode="skip", and turning off robots exclusion, I get a number of
indexing operations. The following Solr log output corresponds to one
such:

INFO: {add=[http://twitter.com/DraRositaperez/statuses/103103998965456896]} 0 2
Aug 16, 2011 5:28:52 AM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/update/extract
params={literal.source=http://search.twitter.com/search.rss?q%3DCampylobacter&literal.id=http://twitter.com/DraRositaperez/statuses/103103998965456896&literal.title=RT+@MicrobeWorld:+Campylobacter+bacteria:+Campylobacter+bacteria+are+the+number-one+cause+of+food-related+gastrointestinal+illness...+http://t.co/0Bk8mTm&literal.pubdate=1313416883000}
status=0 QTime=2

The document's source, title, and pubdate seem to all be set.  The
feed's "description" field is the actual content that is being indexed
into Solr, so that is not present in the Solr url but should be
present in the post data.  So the only question, then, is the
"summary" field.  Looking at the feed itself, I see <title> fields and
<description> fields, but no <content> fields, so it makes sense that
there would be no summary metadata.

Hope this helps.  Does this agree with what you are seeing?
Karl

>
> For the rest, I suspect that you have been running the same job over
> and over again to get the results you describe.  However, you should
> be aware that ManifoldCF is an incremental crawler.  It will NOT
> reindex content that has not changed between job runs.
>
> So the only result that is definitely weird is:
>
>> case 4)  "Dechromed content, if present, in 'description' field" and "Never
>> use chromed content"
>>                      --> Ingests but both "description" and "summary"
fields
>> ARE EMPTY in Solr
>

Mime
View raw message