Return-Path: Delivered-To: apmail-tika-dev-archive@www.apache.org Received: (qmail 44436 invoked from network); 4 Oct 2010 21:53:00 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 4 Oct 2010 21:53:00 -0000 Received: (qmail 48097 invoked by uid 500); 4 Oct 2010 21:53:00 -0000 Delivered-To: apmail-tika-dev-archive@tika.apache.org Received: (qmail 48040 invoked by uid 500); 4 Oct 2010 21:52:59 -0000 Mailing-List: contact dev-help@tika.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@tika.apache.org Delivered-To: mailing list dev@tika.apache.org Received: (qmail 48027 invoked by uid 99); 4 Oct 2010 21:52:59 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 Oct 2010 21:52:59 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 Oct 2010 21:52:57 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id o94LqZ2S002931 for ; Mon, 4 Oct 2010 21:52:35 GMT Message-ID: <26266803.538241286229155464.JavaMail.jira@thor> Date: Mon, 4 Oct 2010 17:52:35 -0400 (EDT) From: "Nick Burch (JIRA)" To: dev@tika.apache.org Subject: [jira] Commented: (TIKA-522) AutoDetectParser treats HTML/XML files as Audio In-Reply-To: <14608505.500951285955433231.JavaMail.jira@thor> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/TIKA-522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917788#action_12917788 ] Nick Burch commented on TIKA-522: --------------------------------- When it goes wrong, can you capture the bytes in the buffer to look at? Also, how long is the length reported as? In theory, MagicDetector should keep trying the read until it either has enough data, or the stream closes > AutoDetectParser treats HTML/XML files as Audio > ----------------------------------------------- > > Key: TIKA-522 > URL: https://issues.apache.org/jira/browse/TIKA-522 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 0.7 > Environment: WIndows 7 x64, java v6.0.170.4, jdk1.6.0_21, Eclipse 20100617-1415 > Reporter: Dennis Adler > Assignee: Ken Krugler > > I am crawling an SMB share. I've used the steps outlined in Tika samples to initialize; given a File object in f, my code is: > parser = new AutoDetectParser(); > context.set(Parser.class, parser); > // Get the URL > URL url = f.toURI().toURL(); > // Extract Metadata > Metadata metadata = new Metadata(); > BodyContentHandler handler = new BodyContentHandler(-1); // -1 = infinite size for XML string buffer (per file) > // Get the input stream > InputStream input = MetadataHelper.getInputStream(url, metadata); > // Parse the document > parser.parse(input, handler, metadata, context); > If I place a breakpoint right after the parser.parse invoke, I find the metadata calling my input out as an Audio file. If I try to debug the parse steps, it correctly tags it as Text/HTML. Seems like a timing-related problem. > I have a half-baked workaround: I invoke Thread.sleep(5000) just after the context.set invoke... in 3 sequential test runs that works fine. Problem is, this was working fine several days ago without that (perhaps my computer was busy with other things and the timing issue did not pop up then). > I have downloade and am building today's 0.8 from svn to see if that helps, though I am concerned about the impacts to the rest of my testing if I have to swtich to 0.8. Just understanding what was going on would be a huge help :) > * UPDATE * I was able to repro this once under the debugger. MimeTypes.detect invokes org.apache.tika.mime.MimeTypes.getMimeType on the input stream to determine the Mime Type based on the first 8k of data. I did not trace into getMimeType, but did see it return "audio/mpeg" on an HTML file one time, and "text/html" most others. I can supply the HTML file if desired. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.