tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "PNS (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-697) Tika reports the content type of AR archives as "text/plain"
Date Mon, 07 Nov 2011 09:35:51 GMT

    [ https://issues.apache.org/jira/browse/TIKA-697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13145299#comment-13145299
] 

PNS commented on TIKA-697:
--------------------------

Detection of Unix AR archive types (see http://en.wikipedia.org/wiki/Ar_(Unix)) is very simple
and can indeed be done either by checking for the 8 "magic" bytes (0x21, 0x3C, 0x61, 0x72,
0x63, 0x68, 0x3E, 0x0A).

What needs to be changed in the Tika code is at least the TextDetector.detect() method, so
that it returns an AR media type if the first 8 bytes of the archive are the AR signature.

The AR MediaType needs to be added in class org.apache.tika.mime.MediaType and it will probably
be a custom one, since apparently there is no IANA-registered MIME type for AR (see http://en.wikipedia.org/wiki/List_of_archive_formats
and http://www.iana.org/assignments/media-types/index.html).

Assuming the existence of a statement like

	public static final MediaType APPLICATION_AR = application("x-ar");

in class org.apache.tika.mime.MediaType, following is a quick implementation of the proposed
changes in the TextDetector.detect() method:

        // Code immediately after the static initialization block of the IS_CONTROL[] array

	private static final byte[] AR_HEADER = new byte[]
   	                     {0x21, 0x3c, 0x61, 0x72, 0x63, 0x68, 0x3e, 0x0a};
	private boolean checkArHeader;

	@Override
	public MediaType detect(InputStream input, Metadata metadata)
	throws IOException {
		if (input == null) {
			return MediaType.OCTET_STREAM;
		}

		input.mark(NUMBER_OF_BYTES_TO_TEST);
                checkArHeader = true;
		try {
			for (int i = 0; i < NUMBER_OF_BYTES_TO_TEST; i++) {
				int ch = input.read();
				if (ch == -1) {
					if (i > 0) {
						return MediaType.TEXT_PLAIN;
					} else {
						// See https://issues.apache.org/jira/browse/TIKA-483
						return MediaType.OCTET_STREAM;
					}
				} else if (ch < IS_CONTROL_BYTE.length && IS_CONTROL_BYTE[ch]) {
					return MediaType.OCTET_STREAM;
				} else if (checkArHeader) {
                                        // See https://issues.apache.org/jira/browse/TIKA-697
					if ((i>7) || (AR_HEADER[i] != ch)) {
						checkArHeader = false;
					} else if ((i==7) && (AR_HEADER[i] == ch)) {
						return MediaType.APPLICATION_AR;
					}
				}
			}
			return MediaType.TEXT_PLAIN;
		} finally {
			input.reset();
		}
	}

Essentially, the additions are just the new MediaType.APPLICATION_AR constant, the 2 new variables
(AR_HEADER, checkArHeader) and the "else if (checkArHeader)" control block.

I have tested the above with numerous combinations of files and it works as expected.

                
> Tika reports the content type of AR archives as "text/plain"
> ------------------------------------------------------------
>
>                 Key: TIKA-697
>                 URL: https://issues.apache.org/jira/browse/TIKA-697
>             Project: Tika
>          Issue Type: Bug
>         Environment: Linux (CentOS 5.6)
>            Reporter: PNS
>            Priority: Trivial
>
> The Tika.detect(InputStream) method returns "text/plain" for AR archives created with
the Linux "Create Archive" option of Nautilus (available via right-clicking on a file).
> The Apache Commons Compress "autodetection" code of the ArchiveStreamFactory looks at
the first 12 bytes of the stream and correctly identifies the type as AR.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message