manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Trouble indexing a Twitter search in RSS format
Date Wed, 17 Aug 2011 14:07:25 GMT
Sorry, I was misaligned.  But it actually is true that the pages differ.  I
captured two fetches of the same document and diff'd them:

root@duck96:~# diff file1.txt file2.txt
408c408
< </html><!-- 1313589725 -->
\ No newline at end of file
---
> </html><!-- 1313589820 -->
\ No newline at end of file
root@duck96:~#

So that is indeed the correct explanation.
Karl


On Wed, Aug 17, 2011 at 10:00 AM, K McGonigal <kmcgoniga@gmail.com> wrote:

> Thanks Karl.  But it looks to me like all the documents are the same size
> in both runs. They are just indexed in a different order (for some unknown
> reason).
>
> Kate
>
>
> On Tue, Aug 16, 2011 at 7:44 PM, Karl Wright <daddywri@gmail.com> wrote:
>
>> Hi Kate,
>>
>> I ran a job based on the same feed twice.  Here are the results, from the
>> simple history:
>>
>> Start Time Activity Identifier Result Code Bytes Time Result Description  08-16-2011
>> 20:38:10.924 job end 1313541280969(jazz)
>>
>> 0 1
>>  08-16-2011 20:37:57.179 document ingest (solr)
>> http://www.onemansjazz.ca/content/view/331/30/
>>  200 16980 18
>>  08-16-2011 20:37:56.241 fetch
>> http://www.onemansjazz.ca/content/view/331/30/
>>  200 16980 905
>>  08-16-2011 20:37:52.117 document ingest (solr)
>> http://www.onemansjazz.ca/content/view/334/30/
>>  200 16718 15
>>  08-16-2011 20:37:51.241 fetch
>> http://www.onemansjazz.ca/content/view/334/30/
>>  200 16718 839
>>  08-16-2011 20:37:47.292 document ingest (solr)
>> http://www.onemansjazz.ca/content/view/330/50/
>>  200 22605 19
>>  08-16-2011 20:37:46.241 fetch
>> http://www.onemansjazz.ca/content/view/330/50/
>>  200 22605 1003
>>  08-16-2011 20:37:42.149 document ingest (solr)
>> http://www.onemansjazz.ca/content/view/333/30/
>>  200 17606 19
>>  08-16-2011 20:37:41.241 fetch
>> http://www.onemansjazz.ca/content/view/333/30/
>>  200 17606 887
>>  08-16-2011 20:37:37.165 document ingest (solr)
>> http://www.onemansjazz.ca/content/view/332/30/
>>  200 17083 20
>>  08-16-2011 20:37:36.241 fetch
>> http://www.onemansjazz.ca/content/view/332/30/
>>  200 17083 898
>>  08-16-2011 20:37:32.783 document ingest (solr)
>> http://www.onemansjazz.ca/content/view/336/30/
>>  200 17473 19
>>  08-16-2011 20:37:31.241 fetch
>> http://www.onemansjazz.ca/content/view/336/30/
>>  200 17473 922
>>  08-16-2011 20:37:27.191 document ingest (solr)
>> http://www.onemansjazz.ca/content/view/329/30/
>>  200 17105 52
>>  08-16-2011 20:37:26.241 fetch
>> http://www.onemansjazz.ca/content/view/329/30/
>>  200 17105 912
>>  08-16-2011 20:37:21.241 fetch
>> http://www.onemansjazz.ca/component/option,com_rss/feed,RSS2....
>> 0/no_html,1/
>>  200 3973 542
>>  08-16-2011 20:37:20.970 job start 1313541280969(jazz)
>>
>> 0 1
>>  08-16-2011 20:37:00.893 job end 1313541280969(jazz)
>>
>> 0 1
>>  08-16-2011 20:36:49.123 document ingest (solr)
>> http://www.onemansjazz.ca/content/view/334/30/
>>  200 16718 17
>>  08-16-2011 20:36:48.076 fetch
>> http://www.onemansjazz.ca/content/view/334/30/
>>  200 16718 1028
>>  08-16-2011 20:36:44.305 document ingest (solr)
>> http://www.onemansjazz.ca/content/view/332/30/
>>  200 17083 34
>>  08-16-2011 20:36:43.076 fetch
>> http://www.onemansjazz.ca/content/view/332/30/
>>  200 17083 1208
>>  08-16-2011 20:36:39.175 document ingest (solr)
>> http://www.onemansjazz.ca/content/view/336/30/
>>  200 17473 23
>>  08-16-2011 20:36:38.076 fetch
>> http://www.onemansjazz.ca/content/view/336/30/
>>  200 17473 1087
>>  08-16-2011 20:36:33.983 document ingest (solr)
>> http://www.onemansjazz.ca/content/view/331/30/
>>  200 16980 24
>>  08-16-2011 20:36:33.076 fetch
>> http://www.onemansjazz.ca/content/view/331/30/
>>  200 16980 896
>>  08-16-2011 20:36:29.297 document ingest (solr)
>> http://www.onemansjazz.ca/content/view/329/30/
>>  200 17105 24
>>  08-16-2011 20:36:28.774 document ingest (solr)
>> http://www.onemansjazz.ca/content/view/330/50/
>>  200 22605 35
>>  08-16-2011 20:36:28.076 fetch
>> http://www.onemansjazz.ca/content/view/329/30/
>>  200 17105 1204
>>  08-16-2011 20:36:23.076 fetch
>> http://www.onemansjazz.ca/content/view/330/50/
>>  200 22605 5679
>>  08-16-2011 20:36:21.130 document ingest (solr)
>> http://www.onemansjazz.ca/content/view/333/30/
>>  200 17606 418
>>  08-16-2011 20:36:18.076 fetch
>> http://www.onemansjazz.ca/content/view/333/30/
>>  200 17606 2969
>>  08-16-2011 20:36:13.094 fetch
>> http://www.onemansjazz.ca/component/option,com_rss/feed,RSS2....
>> 0/no_html,1/
>>  200 3973 1945
>>  08-16-2011 20:36:10.870 job start 1313541280969(jazz)
>>
>> 0 1
>>
>> Note that on each run, the size of each document being indexed changes.
>> This is likely due to "chrome" (advertisements, etc.) which are dynamically
>> delivered by the site in a random way.  The RSS connector will, of course,
>> not be able to recognize that the content you are interested in hasn't
>> changed, because as far as it can tell it *has*.
>>
>> This is very different from the case where you are use the "dechromed"
>> content based on the "description" field, because it is the actual feed
>> description field that is indexed, not the document contents, and therefore
>> no chrome will be present.  Thus you are more likely to see repeated runs of
>> a job index nothing if the job has a "dechromed" content mode set.
>>
>> Karl
>>
>>
>>
>> On Tue, Aug 16, 2011 at 5:07 PM, K McGonigal <kmcgoniga@gmail.com> wrote:
>>
>>> Hmm. I will keep this in mind, but I'm confused again. I just ran this
>>> job twice in a row and pretty much the same thing was sent to Solr.  The
>>> same number of items (7) were "add"ed. I think they were the same items,
>>> just in a different order. The second run also deleted an item from Solr
>>> that was not in the RSS document.  I'm pretty sure the RSS feed document or
>>> the linked documents did not change.
>>>
>>> A snippet from the first run:
>>>
>>> INFO: {add=[http://www.onemansjazz.ca/content/view/330/50/]} 0 16
>>>> 16-Aug-2011 3:18:11 PM org.apache.solr.core.SolrCore execute
>>>> INFO: [] webapp=/solr path=/update/extract params={literal.source=
>>>> http://www.one
>>>>
>>>> mansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/&literal.category=New
>>>>
>>>> s+-+General&literal.summary=I+have+created+a+Listener+Survey+and+if+you+have+the
>>>>
>>>> +time+to+complete+it,+that+would+be+terrific.++I%26#39;m+trying+to+do+an+evaluat
>>>>
>>>> ion+of+One+Man%26#39;s+Jazz+as+well+as+considering+some+new+options+that+have+ar
>>>>
>>>> isen.++Your+feedback+would+be+most+appreciate.This+survey+is+in+two+parts+and+is
>>>>
>>>> +a+total+of+twenty+parts,+most+of+them+just+require+a+click+of+your+mouse.++Clic
>>>> k+here+(
>>>> http://www.surveymonkey.com/s/C3DZ3JK)++for+Part+One,+and+here+(http://w<http://www.surveymonkey.com/s/C3DZ3JK%29++for+Part+One,+and+here+%28http://w>
>>>>
>>>> ww.surveymonkey.com/s/C38FVH8)++for+Part+Two.+++Thanks+again+for+your+input.+&li<http://ww.surveymonkey.com/s/C38FVH8%29++for+Part+Two.+++Thanks+again+for+your+input.+&li>
>>>> teral.id=
>>>> http://www.onemansjazz.ca/content/view/330/50/&literal.title=Listener+S
>>>> urvey&literal.pubdate=1310475289000} status=0 QTime=16
>>>> 16-Aug-2011 3:18:13 PM
>>>> org.apache.solr.update.processor.LogUpdateProcessor finis
>>>> h
>>>>
>>>
>>> A snippet from the second run:
>>>
>>> INFO: {add=[http://www.onemansjazz.ca/content/view/330/50/]} 0 15
>>>> 16-Aug-2011 3:27:55 PM org.apache.solr.core.SolrCore execute
>>>> INFO: [] webapp=/solr path=/update/extract params={literal.source=
>>>> http://www.one
>>>>
>>>> mansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/&literal.category=New
>>>>
>>>> s+-+General&literal.summary=I+have+created+a+Listener+Survey+and+if+you+have+the
>>>>
>>>> +time+to+complete+it,+that+would+be+terrific.++I%26#39;m+trying+to+do+an+evaluat
>>>>
>>>> ion+of+One+Man%26#39;s+Jazz+as+well+as+considering+some+new+options+that+have+ar
>>>>
>>>> isen.++Your+feedback+would+be+most+appreciate.This+survey+is+in+two+parts+and+is
>>>>
>>>> +a+total+of+twenty+parts,+most+of+them+just+require+a+click+of+your+mouse.++Clic
>>>> k+here+(
>>>> http://www.surveymonkey.com/s/C3DZ3JK)++for+Part+One,+and+here+(http://w<http://www.surveymonkey.com/s/C3DZ3JK%29++for+Part+One,+and+here+%28http://w>
>>>>
>>>> ww.surveymonkey.com/s/C38FVH8)++for+Part+Two.+++Thanks+again+for+your+input.+&li<http://ww.surveymonkey.com/s/C38FVH8%29++for+Part+Two.+++Thanks+again+for+your+input.+&li>
>>>> teral.id=
>>>> http://www.onemansjazz.ca/content/view/330/50/&literal.title=Listener+S
>>>> urvey&literal.pubdate=1310475289000} status=0 QTime=15
>>>> 16-Aug-2011 3:28:00 PM
>>>> org.apache.solr.update.processor.LogUpdateProcessor finis
>>>> h
>>>>
>>>
>>> I think they are identical.
>>>
>>>
>>> View a Job
>>>>  ------------------------------
>>>>  Name:OMJ
>>>> ------------------------------
>>>>  Output connection: Solr Repository connection: RSS
>>>> ------------------------------
>>>>  Priority:5 Start method:Don't automatically start
>>>> ------------------------------
>>>>  Schedule type:Scan every document once Minimum recrawl interval:Not
>>>> applicable  Expiration interval:Not applicable Reseed interval:Not
>>>> applicable
>>>> ------------------------------
>>>>  No scheduled run times
>>>> ------------------------------
>>>>    Field mappings:  Metadata field name Solr field name No field
>>>> mapping specified
>>>> ------------------------------
>>>>    RSS urls:
>>>> http://www.onemansjazz.ca/component/option,com_rss/feed,RSS2.0/no_html,1/
>>>>  ------------------------------
>>>> No url canonicalization specified; will reorder all urls and remove all
>>>> sessions
>>>> ------------------------------
>>>> No mappings specified; will accept all urls
>>>> ------------------------------
>>>>  Feed connection timeout (seconds): 60  Default feed rescan interval
>>>> (minutes): 60  Minimum feed rescan interval (minutes): 15  Bad feed
>>>> rescan interval (minutes): (Default feed rescan value)
>>>> ------------------------------
>>>>  Dechromed content source: none  Chromed content: none
>>>> ------------------------------
>>>> No access tokens specified
>>>> ------------------------------
>>>> No metadata specified
>>>
>>>
>>>
>>> View Repository Connection Status
>>>  ------------------------------
>>>  Name:RSS Description:
>>>  ------------------------------
>>>  Connection type:RSS Max connections:10  Authority:None (global
>>> authority)
>>> ------------------------------
>>>  Throttling:  Bin regular expression Description Max avg fetches/min No
>>> throttles
>>> ------------------------------
>>>    Parameters: Proxy port=
>>> Proxy authentication password=********
>>> Max server connections=2
>>> Proxy host=
>>> KB per second=64
>>> Robots usage=none
>>> Proxy authentication user name=
>>> Max fetches per minute=12
>>> Email address=kmcgoniga@gmail.com
>>> Proxy authentication domain=
>>> Throttle group=
>>>    ------------------------------
>>>  Connection status:Connection working
>>>
>>>
>>
>

Mime
View raw message