jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Brosius <dbros...@mebigfatguy.com>
Subject Re: Extracting content from document
Date Sun, 15 Nov 2009 15:37:55 GMT
If there's no direct way...   :)

I suppose you could create your own text extractor that derived from 
MsWordTextExtractor, overrides extractText and delegate to super in a 
try/catch block.

Then specify this extractor in your repository.xml file.

LukashP wrote:
> Hi,
> It's my first post here, so please, be tolerant of any mistakes :).
> I'm importing into Jackrabbit repository a large group of word (*.doc) files
> (batch operation). I've setup Jackrabbit in a way, that content is extracted
> immediately along with importing (commiting transaction to be strict).
> Most of them are fine, and also MsWordExtractor can successfully extract
> text content (that allows me to use full text search later).
> However, for some of them I have a problem : The content can't be extracted
> of whatever reason. That's ok, some of them can be in wrong format or so,
> but I would like to know about such problem immediately.
> The problem is, that when MsWordExtractor is not able to extract content, is
> only logs a warning about it (and i think that's all - log below, i've shown
> only the significant logs). Is there any way I could know about failure of
> extraction immediately, when importing ?
> [15:27:50,699] [WARN ]
> [http-8080-3][PzuSA,demu,BRAK][MsWordTextExtractor.extractText()] Failed to
> extract Word text content
> java.lang.ArrayIndexOutOfBoundsException: 59730
> 	at org.apache.poi.util.LittleEndian.getNumber(LittleEndian.java:475)
> ...
> org.apache.jackrabbit.extractor.MsWordTextExtractor.extractText(MsWordTextExtractor.java:64)
> ...
> org.springframework.transaction.support.AbstractPlatformTransactionManager.commit(AbstractPlatformTransactionManager.java:701)
> ...
> org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:140)
> 	at xxxDocumentRepository.addAsImported(xxxDocumentRepository.java:288)
> I would be thankful for any help.
> Regards, 
> Luke

View raw message