manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From K McGonigal <kmcgon...@gmail.com>
Subject Re: Trouble indexing a Twitter search in RSS format
Date Wed, 17 Aug 2011 14:00:44 GMT
Thanks Karl.  But it looks to me like all the documents are the same size in
both runs. They are just indexed in a different order (for some unknown
reason).

Kate


On Tue, Aug 16, 2011 at 7:44 PM, Karl Wright <daddywri@gmail.com> wrote:

> Hi Kate,
>
> I ran a job based on the same feed twice.  Here are the results, from the
> simple history:
>
> Start Time Activity Identifier Result Code Bytes Time Result Description  08-16-2011
> 20:38:10.924 job end 1313541280969(jazz)
>
> 0 1
>  08-16-2011 20:37:57.179 document ingest (solr)
> http://www.onemansjazz.ca/content/view/331/30/
>  200 16980 18
>  08-16-2011 20:37:56.241 fetch
> http://www.onemansjazz.ca/content/view/331/30/
>  200 16980 905
>  08-16-2011 20:37:52.117 document ingest (solr)
> http://www.onemansjazz.ca/content/view/334/30/
>  200 16718 15
>  08-16-2011 20:37:51.241 fetch
> http://www.onemansjazz.ca/content/view/334/30/
>  200 16718 839
>  08-16-2011 20:37:47.292 document ingest (solr)
> http://www.onemansjazz.ca/content/view/330/50/
>  200 22605 19
>  08-16-2011 20:37:46.241 fetch
> http://www.onemansjazz.ca/content/view/330/50/
>  200 22605 1003
>  08-16-2011 20:37:42.149 document ingest (solr)
> http://www.onemansjazz.ca/content/view/333/30/
>  200 17606 19
>  08-16-2011 20:37:41.241 fetch
> http://www.onemansjazz.ca/content/view/333/30/
>  200 17606 887
>  08-16-2011 20:37:37.165 document ingest (solr)
> http://www.onemansjazz.ca/content/view/332/30/
>  200 17083 20
>  08-16-2011 20:37:36.241 fetch
> http://www.onemansjazz.ca/content/view/332/30/
>  200 17083 898
>  08-16-2011 20:37:32.783 document ingest (solr)
> http://www.onemansjazz.ca/content/view/336/30/
>  200 17473 19
>  08-16-2011 20:37:31.241 fetch
> http://www.onemansjazz.ca/content/view/336/30/
>  200 17473 922
>  08-16-2011 20:37:27.191 document ingest (solr)
> http://www.onemansjazz.ca/content/view/329/30/
>  200 17105 52
>  08-16-2011 20:37:26.241 fetch
> http://www.onemansjazz.ca/content/view/329/30/
>  200 17105 912
>  08-16-2011 20:37:21.241 fetch
> http://www.onemansjazz.ca/component/option,com_rss/feed,RSS2....
> 0/no_html,1/
>  200 3973 542
>  08-16-2011 20:37:20.970 job start 1313541280969(jazz)
>
> 0 1
>  08-16-2011 20:37:00.893 job end 1313541280969(jazz)
>
> 0 1
>  08-16-2011 20:36:49.123 document ingest (solr)
> http://www.onemansjazz.ca/content/view/334/30/
>  200 16718 17
>  08-16-2011 20:36:48.076 fetch
> http://www.onemansjazz.ca/content/view/334/30/
>  200 16718 1028
>  08-16-2011 20:36:44.305 document ingest (solr)
> http://www.onemansjazz.ca/content/view/332/30/
>  200 17083 34
>  08-16-2011 20:36:43.076 fetch
> http://www.onemansjazz.ca/content/view/332/30/
>  200 17083 1208
>  08-16-2011 20:36:39.175 document ingest (solr)
> http://www.onemansjazz.ca/content/view/336/30/
>  200 17473 23
>  08-16-2011 20:36:38.076 fetch
> http://www.onemansjazz.ca/content/view/336/30/
>  200 17473 1087
>  08-16-2011 20:36:33.983 document ingest (solr)
> http://www.onemansjazz.ca/content/view/331/30/
>  200 16980 24
>  08-16-2011 20:36:33.076 fetch
> http://www.onemansjazz.ca/content/view/331/30/
>  200 16980 896
>  08-16-2011 20:36:29.297 document ingest (solr)
> http://www.onemansjazz.ca/content/view/329/30/
>  200 17105 24
>  08-16-2011 20:36:28.774 document ingest (solr)
> http://www.onemansjazz.ca/content/view/330/50/
>  200 22605 35
>  08-16-2011 20:36:28.076 fetch
> http://www.onemansjazz.ca/content/view/329/30/
>  200 17105 1204
>  08-16-2011 20:36:23.076 fetch
> http://www.onemansjazz.ca/content/view/330/50/
>  200 22605 5679
>  08-16-2011 20:36:21.130 document ingest (solr)
> http://www.onemansjazz.ca/content/view/333/30/
>  200 17606 418
>  08-16-2011 20:36:18.076 fetch
> http://www.onemansjazz.ca/content/view/333/30/
>  200 17606 2969
>  08-16-2011 20:36:13.094 fetch
> http://www.onemansjazz.ca/component/option,com_rss/feed,RSS2....
> 0/no_html,1/
>  200 3973 1945
>  08-16-2011 20:36:10.870 job start 1313541280969(jazz)
>
> 0 1
>
> Note that on each run, the size of each document being indexed changes.
> This is likely due to "chrome" (advertisements, etc.) which are dynamically
> delivered by the site in a random way.  The RSS connector will, of course,
> not be able to recognize that the content you are interested in hasn't
> changed, because as far as it can tell it *has*.
>
> This is very different from the case where you are use the "dechromed"
> content based on the "description" field, because it is the actual feed
> description field that is indexed, not the document contents, and therefore
> no chrome will be present.  Thus you are more likely to see repeated runs of
> a job index nothing if the job has a "dechromed" content mode set.
>
> Karl
>
>
>
> On Tue, Aug 16, 2011 at 5:07 PM, K McGonigal <kmcgoniga@gmail.com> wrote:
>
>> Hmm. I will keep this in mind, but I'm confused again. I just ran this job
>> twice in a row and pretty much the same thing was sent to Solr.  The same
>> number of items (7) were "add"ed. I think they were the same items, just in
>> a different order. The second run also deleted an item from Solr that was
>> not in the RSS document.  I'm pretty sure the RSS feed document or the
>> linked documents did not change.
>>
>> A snippet from the first run:
>>
>> INFO: {add=[http://www.onemansjazz.ca/content/view/330/50/]} 0 16
>>> 16-Aug-2011 3:18:11 PM org.apache.solr.core.SolrCore execute
>>> INFO: [] webapp=/solr path=/update/extract params={literal.source=
>>> http://www.one
>>>
>>> mansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/&literal.category=New
>>>
>>> s+-+General&literal.summary=I+have+created+a+Listener+Survey+and+if+you+have+the
>>>
>>> +time+to+complete+it,+that+would+be+terrific.++I%26#39;m+trying+to+do+an+evaluat
>>>
>>> ion+of+One+Man%26#39;s+Jazz+as+well+as+considering+some+new+options+that+have+ar
>>>
>>> isen.++Your+feedback+would+be+most+appreciate.This+survey+is+in+two+parts+and+is
>>>
>>> +a+total+of+twenty+parts,+most+of+them+just+require+a+click+of+your+mouse.++Clic
>>> k+here+(
>>> http://www.surveymonkey.com/s/C3DZ3JK)++for+Part+One,+and+here+(http://w<http://www.surveymonkey.com/s/C3DZ3JK%29++for+Part+One,+and+here+%28http://w>
>>>
>>> ww.surveymonkey.com/s/C38FVH8)++for+Part+Two.+++Thanks+again+for+your+input.+&li<http://ww.surveymonkey.com/s/C38FVH8%29++for+Part+Two.+++Thanks+again+for+your+input.+&li>
>>> teral.id=
>>> http://www.onemansjazz.ca/content/view/330/50/&literal.title=Listener+S
>>> urvey&literal.pubdate=1310475289000} status=0 QTime=16
>>> 16-Aug-2011 3:18:13 PM
>>> org.apache.solr.update.processor.LogUpdateProcessor finis
>>> h
>>>
>>
>> A snippet from the second run:
>>
>> INFO: {add=[http://www.onemansjazz.ca/content/view/330/50/]} 0 15
>>> 16-Aug-2011 3:27:55 PM org.apache.solr.core.SolrCore execute
>>> INFO: [] webapp=/solr path=/update/extract params={literal.source=
>>> http://www.one
>>>
>>> mansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/&literal.category=New
>>>
>>> s+-+General&literal.summary=I+have+created+a+Listener+Survey+and+if+you+have+the
>>>
>>> +time+to+complete+it,+that+would+be+terrific.++I%26#39;m+trying+to+do+an+evaluat
>>>
>>> ion+of+One+Man%26#39;s+Jazz+as+well+as+considering+some+new+options+that+have+ar
>>>
>>> isen.++Your+feedback+would+be+most+appreciate.This+survey+is+in+two+parts+and+is
>>>
>>> +a+total+of+twenty+parts,+most+of+them+just+require+a+click+of+your+mouse.++Clic
>>> k+here+(
>>> http://www.surveymonkey.com/s/C3DZ3JK)++for+Part+One,+and+here+(http://w<http://www.surveymonkey.com/s/C3DZ3JK%29++for+Part+One,+and+here+%28http://w>
>>>
>>> ww.surveymonkey.com/s/C38FVH8)++for+Part+Two.+++Thanks+again+for+your+input.+&li<http://ww.surveymonkey.com/s/C38FVH8%29++for+Part+Two.+++Thanks+again+for+your+input.+&li>
>>> teral.id=
>>> http://www.onemansjazz.ca/content/view/330/50/&literal.title=Listener+S
>>> urvey&literal.pubdate=1310475289000} status=0 QTime=15
>>> 16-Aug-2011 3:28:00 PM
>>> org.apache.solr.update.processor.LogUpdateProcessor finis
>>> h
>>>
>>
>> I think they are identical.
>>
>>
>> View a Job
>>>  ------------------------------
>>>  Name:OMJ
>>> ------------------------------
>>>  Output connection: Solr Repository connection: RSS
>>> ------------------------------
>>>  Priority:5 Start method:Don't automatically start
>>> ------------------------------
>>>  Schedule type:Scan every document once Minimum recrawl interval:Not
>>> applicable  Expiration interval:Not applicable Reseed interval:Not
>>> applicable
>>> ------------------------------
>>>  No scheduled run times
>>> ------------------------------
>>>    Field mappings:  Metadata field name Solr field name No field mapping
>>> specified
>>> ------------------------------
>>>    RSS urls:
>>> http://www.onemansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/
>>>  ------------------------------
>>> No url canonicalization specified; will reorder all urls and remove all
>>> sessions
>>> ------------------------------
>>> No mappings specified; will accept all urls
>>> ------------------------------
>>>  Feed connection timeout (seconds): 60  Default feed rescan interval
>>> (minutes): 60  Minimum feed rescan interval (minutes): 15  Bad feed
>>> rescan interval (minutes): (Default feed rescan value)
>>> ------------------------------
>>>  Dechromed content source: none  Chromed content: none
>>> ------------------------------
>>> No access tokens specified
>>> ------------------------------
>>> No metadata specified
>>
>>
>>
>> View Repository Connection Status
>>  ------------------------------
>>  Name:RSS Description:
>>  ------------------------------
>>  Connection type:RSS Max connections:10  Authority:None (global
>> authority)
>> ------------------------------
>>  Throttling:  Bin regular expression Description Max avg fetches/min No
>> throttles
>> ------------------------------
>>    Parameters: Proxy port=
>> Proxy authentication password=********
>> Max server connections=2
>> Proxy host=
>> KB per second=64
>> Robots usage=none
>> Proxy authentication user name=
>> Max fetches per minute=12
>> Email address=kmcgoniga@gmail.com
>> Proxy authentication domain=
>> Throttle group=
>>    ------------------------------
>>  Connection status:Connection working
>>
>>
>

Mime
View raw message