manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Trouble indexing a Twitter search in RSS format
Date Wed, 17 Aug 2011 00:44:27 GMT
Hi Kate,

I ran a job based on the same feed twice.  Here are the results, from the
simple history:

Start Time Activity Identifier Result Code Bytes Time Result
Description  08-16-2011
20:38:10.924 job end 1313541280969(jazz)

0 1
 08-16-2011 20:37:57.179 document ingest (solr)
http://www.onemansjazz.ca/content/view/331/30/
 200 16980 18
 08-16-2011 20:37:56.241 fetch
http://www.onemansjazz.ca/content/view/331/30/
 200 16980 905
 08-16-2011 20:37:52.117 document ingest (solr)
http://www.onemansjazz.ca/content/view/334/30/
 200 16718 15
 08-16-2011 20:37:51.241 fetch
http://www.onemansjazz.ca/content/view/334/30/
 200 16718 839
 08-16-2011 20:37:47.292 document ingest (solr)
http://www.onemansjazz.ca/content/view/330/50/
 200 22605 19
 08-16-2011 20:37:46.241 fetch
http://www.onemansjazz.ca/content/view/330/50/
 200 22605 1003
 08-16-2011 20:37:42.149 document ingest (solr)
http://www.onemansjazz.ca/content/view/333/30/
 200 17606 19
 08-16-2011 20:37:41.241 fetch
http://www.onemansjazz.ca/content/view/333/30/
 200 17606 887
 08-16-2011 20:37:37.165 document ingest (solr)
http://www.onemansjazz.ca/content/view/332/30/
 200 17083 20
 08-16-2011 20:37:36.241 fetch
http://www.onemansjazz.ca/content/view/332/30/
 200 17083 898
 08-16-2011 20:37:32.783 document ingest (solr)
http://www.onemansjazz.ca/content/view/336/30/
 200 17473 19
 08-16-2011 20:37:31.241 fetch
http://www.onemansjazz.ca/content/view/336/30/
 200 17473 922
 08-16-2011 20:37:27.191 document ingest (solr)
http://www.onemansjazz.ca/content/view/329/30/
 200 17105 52
 08-16-2011 20:37:26.241 fetch
http://www.onemansjazz.ca/content/view/329/30/
 200 17105 912
 08-16-2011 20:37:21.241 fetch
http://www.onemansjazz.ca/component/option,com_rss/feed,RSS2....
0/no_html,1/
 200 3973 542
 08-16-2011 20:37:20.970 job start 1313541280969(jazz)

0 1
 08-16-2011 20:37:00.893 job end 1313541280969(jazz)

0 1
 08-16-2011 20:36:49.123 document ingest (solr)
http://www.onemansjazz.ca/content/view/334/30/
 200 16718 17
 08-16-2011 20:36:48.076 fetch
http://www.onemansjazz.ca/content/view/334/30/
 200 16718 1028
 08-16-2011 20:36:44.305 document ingest (solr)
http://www.onemansjazz.ca/content/view/332/30/
 200 17083 34
 08-16-2011 20:36:43.076 fetch
http://www.onemansjazz.ca/content/view/332/30/
 200 17083 1208
 08-16-2011 20:36:39.175 document ingest (solr)
http://www.onemansjazz.ca/content/view/336/30/
 200 17473 23
 08-16-2011 20:36:38.076 fetch
http://www.onemansjazz.ca/content/view/336/30/
 200 17473 1087
 08-16-2011 20:36:33.983 document ingest (solr)
http://www.onemansjazz.ca/content/view/331/30/
 200 16980 24
 08-16-2011 20:36:33.076 fetch
http://www.onemansjazz.ca/content/view/331/30/
 200 16980 896
 08-16-2011 20:36:29.297 document ingest (solr)
http://www.onemansjazz.ca/content/view/329/30/
 200 17105 24
 08-16-2011 20:36:28.774 document ingest (solr)
http://www.onemansjazz.ca/content/view/330/50/
 200 22605 35
 08-16-2011 20:36:28.076 fetch
http://www.onemansjazz.ca/content/view/329/30/
 200 17105 1204
 08-16-2011 20:36:23.076 fetch
http://www.onemansjazz.ca/content/view/330/50/
 200 22605 5679
 08-16-2011 20:36:21.130 document ingest (solr)
http://www.onemansjazz.ca/content/view/333/30/
 200 17606 418
 08-16-2011 20:36:18.076 fetch
http://www.onemansjazz.ca/content/view/333/30/
 200 17606 2969
 08-16-2011 20:36:13.094 fetch
http://www.onemansjazz.ca/component/option,com_rss/feed,RSS2....
0/no_html,1/
 200 3973 1945
 08-16-2011 20:36:10.870 job start 1313541280969(jazz)

0 1

Note that on each run, the size of each document being indexed changes.
This is likely due to "chrome" (advertisements, etc.) which are dynamically
delivered by the site in a random way.  The RSS connector will, of course,
not be able to recognize that the content you are interested in hasn't
changed, because as far as it can tell it *has*.

This is very different from the case where you are use the "dechromed"
content based on the "description" field, because it is the actual feed
description field that is indexed, not the document contents, and therefore
no chrome will be present.  Thus you are more likely to see repeated runs of
a job index nothing if the job has a "dechromed" content mode set.

Karl


On Tue, Aug 16, 2011 at 5:07 PM, K McGonigal <kmcgoniga@gmail.com> wrote:

> Hmm. I will keep this in mind, but I'm confused again. I just ran this job
> twice in a row and pretty much the same thing was sent to Solr.  The same
> number of items (7) were "add"ed. I think they were the same items, just in
> a different order. The second run also deleted an item from Solr that was
> not in the RSS document.  I'm pretty sure the RSS feed document or the
> linked documents did not change.
>
> A snippet from the first run:
>
> INFO: {add=[http://www.onemansjazz.ca/content/view/330/50/]} 0 16
>> 16-Aug-2011 3:18:11 PM org.apache.solr.core.SolrCore execute
>> INFO: [] webapp=/solr path=/update/extract params={literal.source=
>> http://www.one
>>
>> mansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/&literal.category=New
>>
>> s+-+General&literal.summary=I+have+created+a+Listener+Survey+and+if+you+have+the
>>
>> +time+to+complete+it,+that+would+be+terrific.++I%26#39;m+trying+to+do+an+evaluat
>>
>> ion+of+One+Man%26#39;s+Jazz+as+well+as+considering+some+new+options+that+have+ar
>>
>> isen.++Your+feedback+would+be+most+appreciate.This+survey+is+in+two+parts+and+is
>>
>> +a+total+of+twenty+parts,+most+of+them+just+require+a+click+of+your+mouse.++Clic
>> k+here+(
>> http://www.surveymonkey.com/s/C3DZ3JK)++for+Part+One,+and+here+(http://w<http://www.surveymonkey.com/s/C3DZ3JK%29++for+Part+One,+and+here+%28http://w>
>>
>> ww.surveymonkey.com/s/C38FVH8)++for+Part+Two.+++Thanks+again+for+your+input.+&li<http://ww.surveymonkey.com/s/C38FVH8%29++for+Part+Two.+++Thanks+again+for+your+input.+&li>
>> teral.id=
>> http://www.onemansjazz.ca/content/view/330/50/&literal.title=Listener+S
>> urvey&literal.pubdate=1310475289000} status=0 QTime=16
>> 16-Aug-2011 3:18:13 PM org.apache.solr.update.processor.LogUpdateProcessor
>> finis
>> h
>>
>
> A snippet from the second run:
>
> INFO: {add=[http://www.onemansjazz.ca/content/view/330/50/]} 0 15
>> 16-Aug-2011 3:27:55 PM org.apache.solr.core.SolrCore execute
>> INFO: [] webapp=/solr path=/update/extract params={literal.source=
>> http://www.one
>>
>> mansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/&literal.category=New
>>
>> s+-+General&literal.summary=I+have+created+a+Listener+Survey+and+if+you+have+the
>>
>> +time+to+complete+it,+that+would+be+terrific.++I%26#39;m+trying+to+do+an+evaluat
>>
>> ion+of+One+Man%26#39;s+Jazz+as+well+as+considering+some+new+options+that+have+ar
>>
>> isen.++Your+feedback+would+be+most+appreciate.This+survey+is+in+two+parts+and+is
>>
>> +a+total+of+twenty+parts,+most+of+them+just+require+a+click+of+your+mouse.++Clic
>> k+here+(
>> http://www.surveymonkey.com/s/C3DZ3JK)++for+Part+One,+and+here+(http://w<http://www.surveymonkey.com/s/C3DZ3JK%29++for+Part+One,+and+here+%28http://w>
>>
>> ww.surveymonkey.com/s/C38FVH8)++for+Part+Two.+++Thanks+again+for+your+input.+&li<http://ww.surveymonkey.com/s/C38FVH8%29++for+Part+Two.+++Thanks+again+for+your+input.+&li>
>> teral.id=
>> http://www.onemansjazz.ca/content/view/330/50/&literal.title=Listener+S
>> urvey&literal.pubdate=1310475289000} status=0 QTime=15
>> 16-Aug-2011 3:28:00 PM org.apache.solr.update.processor.LogUpdateProcessor
>> finis
>> h
>>
>
> I think they are identical.
>
>
> View a Job
>>  ------------------------------
>>  Name:OMJ
>> ------------------------------
>>  Output connection: Solr Repository connection: RSS
>> ------------------------------
>>  Priority:5 Start method:Don't automatically start
>> ------------------------------
>>  Schedule type:Scan every document once Minimum recrawl interval:Not
>> applicable  Expiration interval:Not applicable Reseed interval:Not
>> applicable
>> ------------------------------
>>  No scheduled run times
>> ------------------------------
>>    Field mappings:  Metadata field name Solr field name No field mapping
>> specified
>> ------------------------------
>>    RSS urls:
>> http://www.onemansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/
>>  ------------------------------
>> No url canonicalization specified; will reorder all urls and remove all
>> sessions
>> ------------------------------
>> No mappings specified; will accept all urls
>> ------------------------------
>>  Feed connection timeout (seconds): 60  Default feed rescan interval
>> (minutes): 60  Minimum feed rescan interval (minutes): 15  Bad feed
>> rescan interval (minutes): (Default feed rescan value)
>> ------------------------------
>>  Dechromed content source: none  Chromed content: none
>> ------------------------------
>> No access tokens specified
>> ------------------------------
>> No metadata specified
>
>
>
> View Repository Connection Status
>  ------------------------------
>  Name:RSS Description:
>  ------------------------------
>  Connection type:RSS Max connections:10  Authority:None (global authority)
> ------------------------------
>  Throttling:  Bin regular expression Description Max avg fetches/min No
> throttles
> ------------------------------
>    Parameters: Proxy port=
> Proxy authentication password=********
> Max server connections=2
> Proxy host=
> KB per second=64
> Robots usage=none
> Proxy authentication user name=
> Max fetches per minute=12
> Email address=kmcgoniga@gmail.com
> Proxy authentication domain=
> Throttle group=
>    ------------------------------
>  Connection status:Connection working
>
>

Mime
View raw message