chemistry-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Muniz <joncmu...@gmail.com>
Subject How to get Content from PDF document.
Date Thu, 24 Apr 2014 14:22:06 GMT
Hi all.
I have PDF documents that I wanted to extract the contents, as if to
present a summary. And also to show the area containing the text of the
search made ​​by the User.

To do this I'm having to create copies of documents in plain text. For when
I get through the content:
// Load documents under the target folder

ItemIterable<QueryResult> documentsResultSet = sessionCopia.query(
"SELECT * from cmis:document where in_folder('" + parentFolder+ "') and
cmis:name ='" + fileName + "'", false).getPage();

So i get the id

CmisObject object = sessionCopie
.getObject(documentSearchResult.getId());
Document document = (Document) object;

//Here i get the ALL the stream and transform to string where looking for
in the plain text the //searchParam. Using JAVA api.

return
TransformAndExtractInputStreamForStringCmis.getInputStreamToText(document
.getContentStream().getStream(), searchParam);

Could someone point me to a better way of doing it I thought I could do
this search within the content document and extract using something already
indexed.
Finding the indexed document was easy.
But then find the contents inside it and extract it using cmis api would
look like?

Thank you.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message