jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From LukashP <luka...@poczta.onet.pl>
Subject Extracting content from document
Date Sun, 15 Nov 2009 14:54:57 GMT

Hi,
It's my first post here, so please, be tolerant of any mistakes :).
I'm importing into Jackrabbit repository a large group of word (*.doc) files
(batch operation). I've setup Jackrabbit in a way, that content is extracted
immediately along with importing (commiting transaction to be strict).
Most of them are fine, and also MsWordExtractor can successfully extract
text content (that allows me to use full text search later).
However, for some of them I have a problem : The content can't be extracted
of whatever reason. That's ok, some of them can be in wrong format or so,
but I would like to know about such problem immediately.
The problem is, that when MsWordExtractor is not able to extract content, is
only logs a warning about it (and i think that's all - log below, i've shown
only the significant logs). Is there any way I could know about failure of
extraction immediately, when importing ?

[15:27:50,699] [WARN ]
[http-8080-3][PzuSA,demu,BRAK][MsWordTextExtractor.extractText()] Failed to
extract Word text content
java.lang.ArrayIndexOutOfBoundsException: 59730
	at org.apache.poi.util.LittleEndian.getNumber(LittleEndian.java:475)
...
org.apache.jackrabbit.extractor.MsWordTextExtractor.extractText(MsWordTextExtractor.java:64)
...
org.springframework.transaction.support.AbstractPlatformTransactionManager.commit(AbstractPlatformTransactionManager.java:701)
...
org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:140)
	at xxxDocumentRepository.addAsImported(xxxDocumentRepository.java:288)

I would be thankful for any help.

Regards, 
Luke

-- 
View this message in context: http://n4.nabble.com/Extracting-content-from-document-tp621776p621776.html
Sent from the Jackrabbit - Users mailing list archive at Nabble.com.

Mime
View raw message