poi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bugzi...@apache.org
Subject DO NOT REPLY [Bug 49435] New: Metadata extractor fails to extract metadata when using a ByteArrayInputStream
Date Mon, 14 Jun 2010 14:51:27 GMT
https://issues.apache.org/bugzilla/show_bug.cgi?id=49435

           Summary: Metadata extractor fails to extract metadata when
                    using a ByteArrayInputStream
           Product: POI
           Version: 3.6
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: XWPF
        AssignedTo: dev@poi.apache.org
        ReportedBy: dvergnaud@yahoo.com


I'm using function POITextExtractor.getMetadataTextExtractor() on a docx
document to extract its metadata. If the main POITextExtractor has been created
using a "normal" java.io.File, then extracting the metadata seems to work
alright. When using a ByteArrayInputStream instead (for exactly the same
content, read using a simple FileInputStream on the same File object), then
information is changed or missing. 

Example: using the following code: 

POITextExtractor te = ExtractorFactory.createExtractor( new java.io.File(
"X:/projects/termDB/Frederick - Terminology database.docx" ) );
POITextExtractor te2 = te.getMetadataTextExtractor();
String t2 = te2.getText();
System.out.println( t2 );

I get the following output: 
Category = null
ContentStatus = null
ContentType = null
Created = Mon Apr 26 07:01:00 CEST 2010
CreatedString = 2010-04-26T07:01:00Z
Creator = David Vergnaud
Description = null
Identifier = null
Keywords = null
Language = null
LastModifiedBy = David Vergnaud
LastPrinted = null
LastPrintedString = 2010-04-26T07:01:00Z
Modified = Fri Apr 30 10:06:00 CEST 2010
ModifiedString = 2010-04-30T10:06:00Z
Revision = 31
Subject = null
Title = null
Version = null
Application = Microsoft Macintosh Word
AppVersion = 12.0000
Characters = 48573
CharactersWithSpaces = 59651
Company = Finnova AG Bankware
HyperlinkBase = null
HyperlinksChanged = false
Lines = 404
LinksUpToDate = false
Manager = null
Pages = 19
Paragraphs = 97
PresentationFormat = null
Template = Normal.dotm
TotalTime = 835

When using that code instead:
java.io.File file = new java.io.File( "X:/projects/termDB/Frederick -
Terminology database.docx" );
byte[] content = new byte[ (int)( file.length() ) ];
java.io.FileInputStream fis = new java.io.FileInputStream( file );
fis.read( content );
POITextExtractor te = ExtractorFactory.createExtractor( new
ByteArrayInputStream( content ) );
POITextExtractor te2 = te.getMetadataTextExtractor();
String t2 = te2.getText();
System.out.println( t2 );

I get the following output: 
Category = null
ContentStatus = null
ContentType = null
Created = Mon Jun 14 16:39:13 CEST 2010
CreatedString = 2010-06-14T16:39:13Z
Creator = David Vergnaud
Description = null
Identifier = null
Keywords = null
Language = null
LastModifiedBy = null
LastPrinted = null
LastPrintedString = 2010-06-14T16:39:13Z
Modified = null
ModifiedString = 2010-06-14T16:39:18Z
Revision = 31
Subject = null
Title = null
Version = null
Application = Microsoft Macintosh Word
AppVersion = 12.0000
Characters = 48573
CharactersWithSpaces = 59651
Company = Finnova AG Bankware
HyperlinkBase = null
HyperlinksChanged = false
Lines = 404
LinksUpToDate = false
Manager = null
Pages = 19
Paragraphs = 97
PresentationFormat = null
Template = Normal.dotm
TotalTime = 835

Some pieces of information (Created, LastPrintedString, ModifiedString) have
changed, and some pieces are simply not available in the ByteArray version. 

Interestingly, all dates shown in the ByteArray version are actually the date
when the program was executed (today, about 5 minutes ago). 

Incidentally, the first LastModified (in the "File" version) doesn't match the
date as shown by my various operating systems -- all agree on the 3rd of May.
However, I guess that's a difference between Word's internally stored
modification date and the date of the last physical modification of the file
itself at the OS level.

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Mime
View raw message