spamassassin-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Scott Ostrander" <SOstran...@printronix.com>
Subject Extracting text form .rtf and .doc attachments using Extracttext.pm on SA 3.3.1
Date Fri, 04 Jun 2010 17:39:07 GMT
I am using Extracttext from
http://whatever.frukt.org/spamassassin.text.shtml#ExtractText.pm
It extracts text from attached .rtf .doc and some other formats. Then
feeds the results to BAYES and normal body testing.
 
My issues are that it works great with SA 3.2.5, However on the same
server it does not give any results with SA 3.3.1
I downgraded SA back to 3.2.5 and Extracttext works again.
 
The dbg output looks like this in 3.3.1:
Jun 3 07:54:17.447 [11937] dbg: extracttext: Part: application/msword
spam.doc
Jun 3 07:54:17.447 [11937] dbg: extracttext: Match: name "spam.doc" =~
".*\.doc"
Jun 3 07:54:17.534 [11937] dbg: extracttext: External call: antiword
"/usr/bin/antiword","-t","-w","0","-m","UTF-8.txt","-"
Jun 3 07:54:17.537 [11937] info: extracttext: External extraction
command: "/usr/bin/antiword","-t","-w","0","-m","UTF-8.txt","-"
Jun 3 07:54:17.537 [11937] info: extracttext: External extraction
object: 17 application/msword "spam.doc"
Jun 3 07:54:17.538 [11937] info: extracttext: External extraction error:
antiword 0 ?
Jun 3 07:54:17.538 [11937] dbg: extracttext: Match: name "spam.doc" =~
".*\.doc"
Jun 3 07:54:17.538 [11937] dbg: extracttext: External call: unrtf
"/usr/local/bin/unrtf","-t","ExtractText.tags","--nopict"
Jun 3 07:54:17.539 [11937] info: extracttext: External extraction
command: "/usr/local/bin/unrtf","-t","ExtractText.tags","--nopict"
Jun 3 07:54:17.540 [11937] info: extracttext: External extraction
object: 17 application/msword "spam.doc"
Jun 3 07:54:17.540 [11937] info: extracttext: External extraction error:
unrtf 0 ?
Jun 3 07:54:17.616 [11937] dbg: extracttext: Magic:
application/x-ole-storage
Jun 3 07:54:17.617 [11937] dbg: extracttext: Not extracted
Jun 3 07:54:17.617 [11937] dbg: extracttext: X-ExtractText-Words: 0
Jun 3 07:54:17.617 [11937] dbg: extracttext: X-ExtractText-Chars: 0

The dbg output looks like this in 3.2.5:
[7828] dbg: extracttext: Part: application/msword spam.doc
[7828] dbg: extracttext: Match: name "spam.doc" =~ ".*\.doc"
[7828] dbg: extracttext: External call: antiword
"/usr/bin/antiword","-t","-w","0","-m","UTF-8.txt","-"
[7828] info: extracttext: Extracted 40 chars using antiword
[7828] info: extracttext: Text: Viagra
[7828] info: extracttext: Text: Free sex
[7828] info: extracttext: Text: Free porn
[7828] info: extracttext: Text: Cash Out Now
[7828] dbg: extracttext: X-ExtractText-Words: 8
[7828] dbg: extracttext: X-ExtractText-Chars: 40
[7828] dbg: extracttext: X-ExtractText-Tools: antiword
[7828] dbg: extracttext: X-ExtractText-Types: application/msword
[7828] dbg: extracttext: X-ExtractText-Extensions: doc
 
Any thoughts on how to get it to work with 3.3.1?
_____________________________
Scott Ostrander
Staff System Administrator

  


Mime
View raw message