Return-Path: X-Original-To: apmail-tika-dev-archive@www.apache.org Delivered-To: apmail-tika-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A5B567F13 for ; Mon, 7 Nov 2011 09:42:15 +0000 (UTC) Received: (qmail 11546 invoked by uid 500); 7 Nov 2011 09:42:15 -0000 Delivered-To: apmail-tika-dev-archive@tika.apache.org Received: (qmail 11510 invoked by uid 500); 7 Nov 2011 09:42:15 -0000 Mailing-List: contact dev-help@tika.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@tika.apache.org Delivered-To: mailing list dev@tika.apache.org Received: (qmail 11502 invoked by uid 99); 7 Nov 2011 09:42:15 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 07 Nov 2011 09:42:15 +0000 X-ASF-Spam-Status: No, hits=-2001.2 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 07 Nov 2011 09:42:12 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 8407138198 for ; Mon, 7 Nov 2011 09:41:51 +0000 (UTC) Date: Mon, 7 Nov 2011 09:41:51 +0000 (UTC) From: "PNS (Issue Comment Edited) (JIRA)" To: dev@tika.apache.org Message-ID: <1733556902.6162.1320658911542.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <1547642936.6048.1314123389090.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Issue Comment Edited] (TIKA-697) Tika reports the content type of AR archives as "text/plain" MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/TIKA-697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13145299#comment-13145299 ] PNS edited comment on TIKA-697 at 11/7/11 9:40 AM: --------------------------------------------------- Detection of Unix AR archive types (see http://en.wikipedia.org/wiki/Ar_(Unix)) is very simple and can indeed be done either by checking for the 8 "magic" bytes (0x21, 0x3C, 0x61, 0x72, 0x63, 0x68, 0x3E, 0x0A). What needs to be changed in the Tika code is at least the TextDetector.detect() method, so that it returns an AR media type if the first 8 bytes of the archive are the AR signature. The AR MediaType needs to be added in class org.apache.tika.mime.MediaType and it will probably be a custom one, since apparently there is no IANA-registered MIME type for AR (see http://en.wikipedia.org/wiki/List_of_archive_formats and http://www.iana.org/assignments/media-types/index.html). Assuming the existence of a statement like {code} public static final MediaType APPLICATION_AR = application("x-ar"); {code} in class *org.apache.tika.mime.MediaType*, following is a quick implementation of the proposed changes in the *TextDetector.detect()* method: {code} // Code immediately after the static initialization block of the IS_CONTROL[] array private static final byte[] AR_HEADER = new byte[] {0x21, 0x3c, 0x61, 0x72, 0x63, 0x68, 0x3e, 0x0a}; private boolean checkArHeader; @Override public MediaType detect(InputStream input, Metadata metadata) throws IOException { if (input == null) { return MediaType.OCTET_STREAM; } input.mark(NUMBER_OF_BYTES_TO_TEST); checkArHeader = true; try { for (int i = 0; i < NUMBER_OF_BYTES_TO_TEST; i++) { int ch = input.read(); if (ch == -1) { if (i > 0) { return MediaType.TEXT_PLAIN; } else { // See https://issues.apache.org/jira/browse/TIKA-483 return MediaType.OCTET_STREAM; } } else if (ch < IS_CONTROL_BYTE.length && IS_CONTROL_BYTE[ch]) { return MediaType.OCTET_STREAM; } else if (checkArHeader) { // See https://issues.apache.org/jira/browse/TIKA-697 if ((i>7) || (AR_HEADER[i] != ch)) { checkArHeader = false; } else if ((i==7) && (AR_HEADER[i] == ch)) { return MediaType.APPLICATION_AR; } } } return MediaType.TEXT_PLAIN; } finally { input.reset(); } } {code} Essentially, the additions are just the new MediaType.APPLICATION_AR constant, the 2 new variables (AR_HEADER, checkArHeader) and the "else if (checkArHeader)" control block. I have tested the above with numerous combinations of files and it works as expected. was (Author: pns): Detection of Unix AR archive types (see http://en.wikipedia.org/wiki/Ar_(Unix)) is very simple and can indeed be done either by checking for the 8 "magic" bytes (0x21, 0x3C, 0x61, 0x72, 0x63, 0x68, 0x3E, 0x0A). What needs to be changed in the Tika code is at least the TextDetector.detect() method, so that it returns an AR media type if the first 8 bytes of the archive are the AR signature. The AR MediaType needs to be added in class org.apache.tika.mime.MediaType and it will probably be a custom one, since apparently there is no IANA-registered MIME type for AR (see http://en.wikipedia.org/wiki/List_of_archive_formats and http://www.iana.org/assignments/media-types/index.html). Assuming the existence of a statement like {code} public static final MediaType APPLICATION_AR = application("x-ar"); {code} in class **org.apache.tika.mime.MediaType**, following is a quick implementation of the proposed changes in the TextDetector.detect() method: {code} // Code immediately after the static initialization block of the IS_CONTROL[] array private static final byte[] AR_HEADER = new byte[] {0x21, 0x3c, 0x61, 0x72, 0x63, 0x68, 0x3e, 0x0a}; private boolean checkArHeader; @Override public MediaType detect(InputStream input, Metadata metadata) throws IOException { if (input == null) { return MediaType.OCTET_STREAM; } input.mark(NUMBER_OF_BYTES_TO_TEST); checkArHeader = true; try { for (int i = 0; i < NUMBER_OF_BYTES_TO_TEST; i++) { int ch = input.read(); if (ch == -1) { if (i > 0) { return MediaType.TEXT_PLAIN; } else { // See https://issues.apache.org/jira/browse/TIKA-483 return MediaType.OCTET_STREAM; } } else if (ch < IS_CONTROL_BYTE.length && IS_CONTROL_BYTE[ch]) { return MediaType.OCTET_STREAM; } else if (checkArHeader) { // See https://issues.apache.org/jira/browse/TIKA-697 if ((i>7) || (AR_HEADER[i] != ch)) { checkArHeader = false; } else if ((i==7) && (AR_HEADER[i] == ch)) { return MediaType.APPLICATION_AR; } } } return MediaType.TEXT_PLAIN; } finally { input.reset(); } } {code} Essentially, the additions are just the new MediaType.APPLICATION_AR constant, the 2 new variables (AR_HEADER, checkArHeader) and the "else if (checkArHeader)" control block. I have tested the above with numerous combinations of files and it works as expected. > Tika reports the content type of AR archives as "text/plain" > ------------------------------------------------------------ > > Key: TIKA-697 > URL: https://issues.apache.org/jira/browse/TIKA-697 > Project: Tika > Issue Type: Bug > Environment: Linux (CentOS 5.6) > Reporter: PNS > Priority: Trivial > > The Tika.detect(InputStream) method returns "text/plain" for AR archives created with the Linux "Create Archive" option of Nautilus (available via right-clicking on a file). > The Apache Commons Compress "autodetection" code of the ArchiveStreamFactory looks at the first 12 bytes of the stream and correctly identifies the type as AR. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira