lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Jelsma - Buyways B.V." <mar...@buyways.nl>
Subject Re: Tika trouble
Date Mon, 16 Nov 2009 09:04:12 GMT
Anyone has a clue?



> List,
> 
> 
> I somehow fail to index certain pdf files using the
> ExtractingRequestHandler in Solr 1.4 with default solrconfig.xml but
> modified schema. I have a very simple schema for this case using only
> and ID field, a timestamp field and two dynamic fields; ignored_* and
> attr_* both indexed, stored and multivalued strings. They are
> multivalued simple because some HTML files fail when storing multiple
> hyperlinks.
> 
> I have posted multiple files to
> http://.../update/extract?literal.id=doc1 including:
> 1. the whitepaper at
> http://www.lucidimagination.com/whitepaper/whats-new-in-lucene-2-9?sc=AP
> 2. the html file of the frontpage of http://nu.nl/
> 3. another pdf at
> http://www.google.nl/url?sa=t&source=web&ct=res&cd=1&ved=0CAcQFjAA&url=http%3A%2F%2Fcsl.stanford.edu%2F~christos%2Fpublications%2F2007.cmp_mapreduce.hpca.pdf&rct=j&q=2007.cmp_mapreduce.hpca.pdf&ei=PPz7SpiiOM6l4QbZjKjRAw&usg=AFQjCNHs-olxbUQrGCXpNMHfcZvY8aMk8A
> 
> For each document i have a corresponding select/?q=*:*:
> 
> 
> 1. No text? Should i see something?
> 
> <doc><str name="id">doc1</str>
> <arr name="ignored_content_type">
> <str>application/octet-stream</str>
> </arr>
> <arr name="ignored_stream_content_type">
> <str>
> text/xml; charset=UTF-8;
> boundary=----------------------------cf57b4ad644d
> </str>
> </arr>
> <arr name="ignored_stream_size">
> <str>491238</str>
> </arr>
> <arr name="ignored_text">
> <str>        </str>
> </arr>
> <date name="timestamp">2009-11-12T12:17:23.016Z</date>
> </doc>
> 
> 
> 2. Plenty of data, this seems to be ok
> 
> <doc>
> <str name="id">doc1</str>
> <arr name="ignored_content_type">
> <str>application/xhtml+xml</str>
> </arr>
> <arr name="ignored_links">
> <str>http://www.nu.nl/</str>
> <str>http://www.nu.nl/</str>
> <str>http://www.nu.nl/algemeen/</str>
> <str>http://www.nu.nl/economie/</str>
> ....
> <arr name="ignored_stream_content_type">
> <str>
> text/xml; charset=UTF-8;
> boundary=----------------------------b6e44d087bdd
> </str>
> </arr>
> <arr name="ignored_stream_size">
> <str>36991</str>
> </arr>
> <arr name="ignored_text">
> <str>
> A LOT OF TEXT HERE
> </str>
> </arr>
> <date name="timestamp">2009-11-12T12:19:15.415Z</date>
> </doc>
> 
> 
> 3. a lot of garbage
> 
> <doc>
> <str name="id">doc1</str>
> <arr name="ignored_content_encoding">
> <str>windows-1252</str>
> </arr>
> <arr name="ignored_content_language">
> <str>fr</str>
> </arr>
> <arr name="ignored_content_type">
> <str>text/plain</str>
> </arr>
> <arr name="ignored_language">
> <str>fr</str>
> </arr>
> <arr name="ignored_stream_content_type">
> <str>
> text/xml; charset=UTF-8;
> boundary=----------------------------83df0fd4d358
> </str>
> </arr>
> <arr name="ignored_stream_size">
> <str>361458</str>
> </arr>
> <arr name="ignored_text">
> <str>
> A LOT OF GARBAGE HERE including
> 
> ió½·Þp™ó 4­0› 
> š©xÓ ^CøùI3람š³î¨V ÚÜ¡yS4 ¹£ ² ›H 6õɨ5¤ ÅÜ磩bädÒøŸ\
�s%OîÐÙIÑYRäŠ ;4
> ¢9"r "—!rEôˆÌ {SìûD²à £©ïœ«{‘ínÆ N÷ô¥F»�™ ±¡Ë'ú\³=·m„Þ
»ý)³Å=j¶B¢)`  Ñ
> „Ï™hjCu{£É5{¢¯ç6½Ñhr¢ºÃ=J M- AqsøtÜì ÿ^Rl S?¿óšM‰—lv‘Ø›Qüãý´
þžŽ
> $S;¾¦wze³Ù)qÉú§ ‰› ãqó…Ó ‰ª"U:šBÝ‘GuŠ"ë
> MM±Òv �~ ‚N‹t¢ä§~Ì ÞŒS—Êòö¼ÊÄQaº¸¿7tñ ¾Áç œãØŒ58$O 3Å~�8¿L
 ‡ëŽó©pk_
> Ša Â=u×; (ä<¹@.œ÷ä ù° µk+ÿ PP~ ¨*ݤ¿Œ™¡D»   @fI$0°�Î Ù·p“Œ,Øâ
 †¶v
> ¤v1#8¼0 ›  èð€-†šZ 6¾  ! ñb ˆbˆ¤v)LS)T X² ¬ l!@€  6E$Q
> endstream
> endobj
> 137 0
> obj<</Type/Encoding/BaseEncoding/WinAnsiEncoding/Differences[1/W/o/r/d/C/u/n/t/M/a/i/x/l/S/g/c/h/K/m/e/s/R/v/I/P/A/H/L/space/p]>>
> endobj
> 138 0 obj<</Type/FontDescriptor/FontFile2 136 0 R/FontBBox[0 -210 942
> 728]/FontName/WQHWKD+TTE31911E0t00/Flags 4/MissingWidth 750/StemV
> 141/CapHeight 728/Ascent 728/Descent -210/ItalicAngle 0>>
> endobj
> 139 0 obj<</Count 12/Kids[140 0 R 141 0 R]/Type/Pages>>
> endobj
> 140 0 obj<</Count 6/Kids[147 0 R 1 0 R 4 0 R 7 0 R 22 0 R 25 0
> R]/Type/Pages/Parent 139 0 R>>
> endobj
> 141 0 obj<</Count 6/Kids[39 0 R 42 0 R 45 0 R 82 0 R 92 0 R 122 0
> R]/Type/Pages/Parent 
> 
> ....
> 
> </str>
> </arr>
> <date name="timestamp">2009-11-12T12:21:28.306Z</date>
> </doc>
> 
> 
> Any ideas? Why doesn't the whitepaper produce any results and why is the
> next whitepaper full of garbage? At least i'm happy that HTML works
> fine.
> 
> 
> 
> Regards,
> 
> -  
> Markus Jelsma          Buyways B.V.            
> Technisch Architect    Friesestraatweg 215c    
> http://www.buyways.nl  9743 AD Groningen       
> 
> 
> Alg. 050-853 6600      KvK  01074105
> Tel. 050-853 6620      Fax. 050-3118124
> Mob. 06-5025 8350      In: http://www.linkedin.com/in/markus17
> 

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message