Sorry, I was misaligned.  But it actually is true that the pages differ.  I captured two fetches of the same document and diff'd them:

root@duck96:~# diff file1.txt file2.txt
408c408
< </html><!-- 1313589725 -->
\ No newline at end of file
---
> </html><!-- 1313589820 -->
\ No newline at end of file
root@duck96:~#

So that is indeed the correct explanation.
Karl


On Wed, Aug 17, 2011 at 10:00 AM, K McGonigal <kmcgoniga@gmail.com> wrote:
Thanks Karl.  But it looks to me like all the documents are the same size in both runs. They are just indexed in a different order (for some unknown reason).

Kate


On Tue, Aug 16, 2011 at 7:44 PM, Karl Wright <daddywri@gmail.com> wrote:
Hi Kate,

I ran a job based on the same feed twice.  Here are the results, from the simple history:

Start Time Activity Identifier Result Code Bytes Time Result Description
08-16-2011 20:38:10.924 job end 1313541280969(jazz)

0 1
08-16-2011 20:37:57.179 document ingest (solr) http://www.onemansjazz.ca/content/view/331/30/
200 16980 18
08-16-2011 20:37:56.241 fetch http://www.onemansjazz.ca/content/view/331/30/
200 16980 905
08-16-2011 20:37:52.117 document ingest (solr) http://www.onemansjazz.ca/content/view/334/30/
200 16718 15
08-16-2011 20:37:51.241 fetch http://www.onemansjazz.ca/content/view/334/30/
200 16718 839
08-16-2011 20:37:47.292 document ingest (solr) http://www.onemansjazz.ca/content/view/330/50/
200 22605 19
08-16-2011 20:37:46.241 fetch http://www.onemansjazz.ca/content/view/330/50/
200 22605 1003
08-16-2011 20:37:42.149 document ingest (solr) http://www.onemansjazz.ca/content/view/333/30/
200 17606 19
08-16-2011 20:37:41.241 fetch http://www.onemansjazz.ca/content/view/333/30/
200 17606 887
08-16-2011 20:37:37.165 document ingest (solr) http://www.onemansjazz.ca/content/view/332/30/
200 17083 20
08-16-2011 20:37:36.241 fetch http://www.onemansjazz.ca/content/view/332/30/
200 17083 898
08-16-2011 20:37:32.783 document ingest (solr) http://www.onemansjazz.ca/content/view/336/30/
200 17473 19
08-16-2011 20:37:31.241 fetch http://www.onemansjazz.ca/content/view/336/30/
200 17473 922
08-16-2011 20:37:27.191 document ingest (solr) http://www.onemansjazz.ca/content/view/329/30/
200 17105 52
08-16-2011 20:37:26.241 fetch http://www.onemansjazz.ca/content/view/329/30/
200 17105 912
08-16-2011 20:37:21.241 fetch http://www.onemansjazz.ca/component/option,com_rss/feed,RSS2....
0/no_html,1/
200 3973 542
08-16-2011 20:37:20.970 job start 1313541280969(jazz)

0 1
08-16-2011 20:37:00.893 job end 1313541280969(jazz)

0 1
08-16-2011 20:36:49.123 document ingest (solr) http://www.onemansjazz.ca/content/view/334/30/
200 16718 17
08-16-2011 20:36:48.076 fetch http://www.onemansjazz.ca/content/view/334/30/
200 16718 1028
08-16-2011 20:36:44.305 document ingest (solr) http://www.onemansjazz.ca/content/view/332/30/
200 17083 34
08-16-2011 20:36:43.076 fetch http://www.onemansjazz.ca/content/view/332/30/
200 17083 1208
08-16-2011 20:36:39.175 document ingest (solr) http://www.onemansjazz.ca/content/view/336/30/
200 17473 23
08-16-2011 20:36:38.076 fetch http://www.onemansjazz.ca/content/view/336/30/
200 17473 1087
08-16-2011 20:36:33.983 document ingest (solr) http://www.onemansjazz.ca/content/view/331/30/
200 16980 24
08-16-2011 20:36:33.076 fetch http://www.onemansjazz.ca/content/view/331/30/
200 16980 896
08-16-2011 20:36:29.297 document ingest (solr) http://www.onemansjazz.ca/content/view/329/30/
200 17105 24
08-16-2011 20:36:28.774 document ingest (solr) http://www.onemansjazz.ca/content/view/330/50/
200 22605 35
08-16-2011 20:36:28.076 fetch http://www.onemansjazz.ca/content/view/329/30/
200 17105 1204
08-16-2011 20:36:23.076 fetch http://www.onemansjazz.ca/content/view/330/50/
200 22605 5679
08-16-2011 20:36:21.130 document ingest (solr) http://www.onemansjazz.ca/content/view/333/30/
200 17606 418
08-16-2011 20:36:18.076 fetch http://www.onemansjazz.ca/content/view/333/30/
200 17606 2969
08-16-2011 20:36:13.094 fetch http://www.onemansjazz.ca/component/option,com_rss/feed,RSS2....
0/no_html,1/
200 3973 1945
08-16-2011 20:36:10.870 job start 1313541280969(jazz)

0 1

Note that on each run, the size of each document being indexed changes.  This is likely due to "chrome" (advertisements, etc.) which are dynamically delivered by the site in a random way.  The RSS connector will, of course, not be able to recognize that the content you are interested in hasn't changed, because as far as it can tell it *has*.

This is very different from the case where you are use the "dechromed" content based on the "description" field, because it is the actual feed description field that is indexed, not the document contents, and therefore no chrome will be present.  Thus you are more likely to see repeated runs of a job index nothing if the job has a "dechromed" content mode set.

Karl



On Tue, Aug 16, 2011 at 5:07 PM, K McGonigal <kmcgoniga@gmail.com> wrote:
Hmm. I will keep this in mind, but I'm confused again. I just ran this job twice in a row and pretty much the same thing was sent to Solr.  The same number of items (7) were "add"ed. I think they were the same items, just in a different order. The second run also deleted an item from Solr that was not in the RSS document.  I'm pretty sure the RSS feed document or the linked documents did not change.

A snippet from the first run:

INFO: {add=[http://www.onemansjazz.ca/content/view/330/50/]} 0 16
16-Aug-2011 3:18:11 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/update/extract params={literal.source=http://www.one
mansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/&literal.category=New
s+-+General&literal.summary=I+have+created+a+Listener+Survey+and+if+you+have+the
+time+to+complete+it,+that+would+be+terrific.++I%26#39;m+trying+to+do+an+evaluat
ion+of+One+Man%26#39;s+Jazz+as+well+as+considering+some+new+options+that+have+ar
isen.++Your+feedback+would+be+most+appreciate.This+survey+is+in+two+parts+and+is
+a+total+of+twenty+parts,+most+of+them+just+require+a+click+of+your+mouse.++Clic
k+here+(http://www.surveymonkey.com/s/C3DZ3JK)++for+Part+One,+and+here+(http://w
ww.surveymonkey.com/s/C38FVH8)++for+Part+Two.+++Thanks+again+for+your+input.+&li
teral.id=http://www.onemansjazz.ca/content/view/330/50/&literal.title=Listener+S
urvey&literal.pubdate=1310475289000} status=0 QTime=16
16-Aug-2011 3:18:13 PM org.apache.solr.update.processor.LogUpdateProcessor finis
h

A snippet from the second run:

INFO: {add=[http://www.onemansjazz.ca/content/view/330/50/]} 0 15
16-Aug-2011 3:27:55 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/update/extract params={literal.source=http://www.one
mansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/&literal.category=New
s+-+General&literal.summary=I+have+created+a+Listener+Survey+and+if+you+have+the
+time+to+complete+it,+that+would+be+terrific.++I%26#39;m+trying+to+do+an+evaluat
ion+of+One+Man%26#39;s+Jazz+as+well+as+considering+some+new+options+that+have+ar
isen.++Your+feedback+would+be+most+appreciate.This+survey+is+in+two+parts+and+is
+a+total+of+twenty+parts,+most+of+them+just+require+a+click+of+your+mouse.++Clic
k+here+(http://www.surveymonkey.com/s/C3DZ3JK)++for+Part+One,+and+here+(http://w
ww.surveymonkey.com/s/C38FVH8)++for+Part+Two.+++Thanks+again+for+your+input.+&li
teral.id=http://www.onemansjazz.ca/content/view/330/50/&literal.title=Listener+S
urvey&literal.pubdate=1310475289000} status=0 QTime=15
16-Aug-2011 3:28:00 PM org.apache.solr.update.processor.LogUpdateProcessor finis
h

I think they are identical.


View a Job


Name:OMJ

Output connection: Solr Repository connection: RSS

Priority:5 Start method:Don't automatically start

Schedule type:Scan every document once Minimum recrawl interval:Not applicable
Expiration interval:Not applicable Reseed interval:Not applicable

No scheduled run times

Field mappings:
Metadata field name Solr field name
No field mapping specified

RSS urls: http://www.onemansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/

No url canonicalization specified; will reorder all urls and remove all sessions

No mappings specified; will accept all urls

Feed connection timeout (seconds): 60
Default feed rescan interval (minutes): 60
Minimum feed rescan interval (minutes): 15
Bad feed rescan interval (minutes): (Default feed rescan value)

Dechromed content source: none
Chromed content: none

No access tokens specified

No metadata specified


View Repository Connection Status


Name:RSS Description:

Connection type:RSS Max connections:10
Authority:None (global authority)

Throttling:
Bin regular expression Description Max avg fetches/min
No throttles

Parameters: Proxy port=
Proxy authentication password=********
Max server connections=2
Proxy host=
KB per second=64
Robots usage=none
Proxy authentication user name=
Max fetches per minute=12
Email address=kmcgoniga@gmail.com
Proxy authentication domain=
Throttle group=

Connection status:Connection working