Mailing-List: contact dev-help@tika.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@tika.apache.org
Date: Mon, 7 Nov 2011 09:41:51 +0000 (UTC)
From: "PNS (Issue Comment Edited) (JIRA)" <jira@apache.org>
To: dev@tika.apache.org
Message-ID: 
 <1733556902.6162.1320658911542.JavaMail.tomcat@hel.zones.apache.org>
In-Reply-To: 
 <1547642936.6048.1314123389090.JavaMail.tomcat@hel.zones.apache.org>
Subject: [jira] [Issue Comment Edited] (TIKA-697) Tika reports the content
 type of AR archives as "text/plain"
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/TIKA-697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13145299#comment-13145299 ] 

PNS edited comment on TIKA-697 at 11/7/11 9:40 AM:
---------------------------------------------------

Detection of Unix AR archive types (see http://en.wikipedia.org/wiki/Ar_(Unix)) is very simple and can indeed be done either by checking for the 8 "magic" bytes (0x21, 0x3C, 0x61, 0x72, 0x63, 0x68, 0x3E, 0x0A).

What needs to be changed in the Tika code is at least the TextDetector.detect() method, so that it returns an AR media type if the first 8 bytes of the archive are the AR signature.

The AR MediaType needs to be added in class org.apache.tika.mime.MediaType and it will probably be a custom one, since apparently there is no IANA-registered MIME type for AR (see http://en.wikipedia.org/wiki/List_of_archive_formats and http://www.iana.org/assignments/media-types/index.html).

Assuming the existence of a statement like

{code}
	public static final MediaType APPLICATION_AR = application("x-ar");
{code}
in class *org.apache.tika.mime.MediaType*, following is a quick implementation of the proposed changes in the *TextDetector.detect()* method:
{code}
	// Code immediately after the static initialization block of the IS_CONTROL[] array

	private static final byte[] AR_HEADER = new byte[]
   	                     {0x21, 0x3c, 0x61, 0x72, 0x63, 0x68, 0x3e, 0x0a};
	private boolean checkArHeader;

	@Override
	public MediaType detect(InputStream input, Metadata metadata)
	throws IOException {
		if (input == null) {
			return MediaType.OCTET_STREAM;
		}

		input.mark(NUMBER_OF_BYTES_TO_TEST);
                checkArHeader = true;
		try {
			for (int i = 0; i < NUMBER_OF_BYTES_TO_TEST; i++) {
				int ch = input.read();
				if (ch == -1) {
					if (i > 0) {
						return MediaType.TEXT_PLAIN;
					} else {
						// See https://issues.apache.org/jira/browse/TIKA-483
						return MediaType.OCTET_STREAM;
					}
				} else if (ch < IS_CONTROL_BYTE.length && IS_CONTROL_BYTE[ch]) {
					return MediaType.OCTET_STREAM;
				} else if (checkArHeader) {
                                        // See https://issues.apache.org/jira/browse/TIKA-697
					if ((i>7) || (AR_HEADER[i] != ch)) {
						checkArHeader = false;
					} else if ((i==7) && (AR_HEADER[i] == ch)) {
						return MediaType.APPLICATION_AR;
					}
				}
			}
			return MediaType.TEXT_PLAIN;
		} finally {
			input.reset();
		}
	}
{code}
Essentially, the additions are just the new MediaType.APPLICATION_AR constant, the 2 new variables (AR_HEADER, checkArHeader) and the "else if (checkArHeader)" control block.

I have tested the above with numerous combinations of files and it works as expected.

                
      was (Author: pns):
    Detection of Unix AR archive types (see http://en.wikipedia.org/wiki/Ar_(Unix)) is very simple and can indeed be done either by checking for the 8 "magic" bytes (0x21, 0x3C, 0x61, 0x72, 0x63, 0x68, 0x3E, 0x0A).

What needs to be changed in the Tika code is at least the TextDetector.detect() method, so that it returns an AR media type if the first 8 bytes of the archive are the AR signature.

The AR MediaType needs to be added in class org.apache.tika.mime.MediaType and it will probably be a custom one, since apparently there is no IANA-registered MIME type for AR (see http://en.wikipedia.org/wiki/List_of_archive_formats and http://www.iana.org/assignments/media-types/index.html).

Assuming the existence of a statement like

{code}
	public static final MediaType APPLICATION_AR = application("x-ar");
{code}
in class **org.apache.tika.mime.MediaType**, following is a quick implementation of the proposed changes in the TextDetector.detect() method:
{code}
        // Code immediately after the static initialization block of the IS_CONTROL[] array

	private static final byte[] AR_HEADER = new byte[]
   	                     {0x21, 0x3c, 0x61, 0x72, 0x63, 0x68, 0x3e, 0x0a};
	private boolean checkArHeader;

	@Override
	public MediaType detect(InputStream input, Metadata metadata)
	throws IOException {
		if (input == null) {
			return MediaType.OCTET_STREAM;
		}

		input.mark(NUMBER_OF_BYTES_TO_TEST);
                checkArHeader = true;
		try {
			for (int i = 0; i < NUMBER_OF_BYTES_TO_TEST; i++) {
				int ch = input.read();
				if (ch == -1) {
					if (i > 0) {
						return MediaType.TEXT_PLAIN;
					} else {
						// See https://issues.apache.org/jira/browse/TIKA-483
						return MediaType.OCTET_STREAM;
					}
				} else if (ch < IS_CONTROL_BYTE.length && IS_CONTROL_BYTE[ch]) {
					return MediaType.OCTET_STREAM;
				} else if (checkArHeader) {
                                        // See https://issues.apache.org/jira/browse/TIKA-697
					if ((i>7) || (AR_HEADER[i] != ch)) {
						checkArHeader = false;
					} else if ((i==7) && (AR_HEADER[i] == ch)) {
						return MediaType.APPLICATION_AR;
					}
				}
			}
			return MediaType.TEXT_PLAIN;
		} finally {
			input.reset();
		}
	}
{code}
Essentially, the additions are just the new MediaType.APPLICATION_AR constant, the 2 new variables (AR_HEADER, checkArHeader) and the "else if (checkArHeader)" control block.

I have tested the above with numerous combinations of files and it works as expected.

                  
> Tika reports the content type of AR archives as "text/plain"
> ------------------------------------------------------------
>
>                 Key: TIKA-697
>                 URL: https://issues.apache.org/jira/browse/TIKA-697
>             Project: Tika
>          Issue Type: Bug
>         Environment: Linux (CentOS 5.6)
>            Reporter: PNS
>            Priority: Trivial
>
> The Tika.detect(InputStream) method returns "text/plain" for AR archives created with the Linux "Create Archive" option of Nautilus (available via right-clicking on a file).
> The Apache Commons Compress "autodetection" code of the ArchiveStreamFactory looks at the first 12 bytes of the stream and correctly identifies the type as AR.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira