Return-Path: Delivered-To: apmail-incubator-jena-users-archive@minotaur.apache.org Received: (qmail 16658 invoked from network); 19 Dec 2010 17:46:06 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 19 Dec 2010 17:46:06 -0000 Received: (qmail 39489 invoked by uid 500); 19 Dec 2010 17:46:06 -0000 Delivered-To: apmail-incubator-jena-users-archive@incubator.apache.org Received: (qmail 39463 invoked by uid 500); 19 Dec 2010 17:46:05 -0000 Mailing-List: contact jena-users-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: jena-users@incubator.apache.org Delivered-To: mailing list jena-users@incubator.apache.org Received: (qmail 39455 invoked by uid 99); 19 Dec 2010 17:46:05 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 19 Dec 2010 17:46:05 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [83.222.232.116] (HELO charlie.justhostme.co.uk) (83.222.232.116) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 19 Dec 2010 17:45:59 +0000 Received: from cpc2-aztw23-2-0-cust840.aztw.cable.virginmedia.com ([94.171.235.73] helo=[192.168.1.10]) by charlie.justhostme.co.uk with esmtpsa (TLSv1:AES256-SHA:256) (Exim 4.69) (envelope-from ) id 1PUNKD-0006OH-9L for jena-users@incubator.apache.org; Sun, 19 Dec 2010 17:45:37 +0000 Message-ID: <4D0E44BE.3020403@epimorphics.com> Date: Sun, 19 Dec 2010 17:45:34 +0000 From: Andy Seaborne User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.13) Gecko/20101208 Thunderbird/3.1.7 MIME-Version: 1.0 To: jena-users@incubator.apache.org Subject: Re: Turtle file with UTF-8 BOM fails to parse References: <4D0B6F41.2010703@epimorphics.com> <4b0b76a09e44ad5c96f4d0b5975d0c20@ecs.soton.ac.uk> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - charlie.justhostme.co.uk X-AntiAbuse: Original Domain - incubator.apache.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - epimorphics.com Rob, Thanks - and fixed in RIOT (ARQ SVN on SF) and the current Jena readers (Jena CVS on SF). It covers TriG as well. The same is true for SPARQL - the most direct fix is to skip BOM at the start of the parse but to do that requires a grammar change. That's what I did for Turte/N3 in Jena. But the SPARQL grammar is the grammar used to produce the spec HTML and I don't want to contaminate the spec. For now, I've added a wrapper that will remove a leading BOM so it's functionally correct and can remove the wrapper, move the BOM processing to the grammar when the spec is frozen. Andy On 18/12/10 13:09, Rob Vesse wrote: > Hi Andy > > I've created a JIRA issue for this - > https://issues.apache.org/jira/browse/JENA-12 > > I appreciate the need for minimal, complete examples as I have enough > trouble getting those out of users on my own support lists > > Thanks, > > Rob > > On Fri, 17 Dec 2010 14:10:09 +0000, Andy Seaborne > wrote: >> Hi Rob, >> >> Thanks for the minimal, complete, example. >> >> The parsers should cope with a UTF-8 BOM even if it's not recommended. >> >> Could you raise a JIRA issue for this please (it's the new process!). >> It'll need fixing in Jena and RIOT. >> >> Andy >> >> On 17/12/10 11:42, Rob Vesse wrote: >>> Hi all >>> >>> I had this issue reported to me recently and have been able to confirm >>> it myself (example data file attached). Essentially the issue is that if >>> a Turtle file has a BOM at the start then Jena will refuse to parse it >>> giving the following error: >>> >>> Exception in thread "main" >>> com.hp.hpl.jena.n3.turtle.TurtleParseException: Lexical error at line 1, >>> column 2. Encountered: "@" (64), after : "\ufeff" >>> at com.hp.hpl.jena.n3.turtle.ParserTurtle.parse(ParserTurtle.java:44) >>> at >>> com.hp.hpl.jena.n3.turtle.TurtleReader.readWorker(TurtleReader.java:21) >>> at com.hp.hpl.jena.n3.JenaReaderBase.readImpl(JenaReaderBase.java:101) >>> at com.hp.hpl.jena.n3.JenaReaderBase.read(JenaReaderBase.java:68) >>> at com.hp.hpl.jena.rdf.model.impl.ModelCom.read(ModelCom.java:226) >>> at TurtleWithBOM.main(TurtleWithBOM.java:31) >>> >>> The code I used to produce this error was as follows: >>> >>> import com.hp.hpl.jena.rdf.model.*; >>> import com.hp.hpl.jena.util.FileManager; >>> >>> import java.io.*; >>> >>> public class TurtleWithBOM >>> { >>> >>> public static void main(String[] args) >>> { >>> >>> // create an empty model >>> Model model = ModelFactory.createDefaultModel(); >>> >>> InputStream in = FileManager.get().open( "ttl-with-bom.ttl" ); >>> if (in == null) >>> { >>> throw new IllegalArgumentException( "File: ttl-with-bom.ttl not found"); >>> } >>> >>> // read the Turtle file >>> model.read(in, "", "TTL"); >>> >>> // write it to standard out >>> model.write(System.out); >>> } >>> } >>> >>> A sample data file used with the above code to reproduce the error is >>> attached. >>> >>> The data files are coming from my software which is all written in .Net >>> and when outputting in UTF-8 the default behaviour of .Net is to include >>> the BOM at the start of the file. The BOM is not required for UTF-8 but >>> it is not forbidden so I think this should be fixed (if possible) for >>> future releases. I will be modifying my software so that output of the >>> BOM can be disabled by my users if desired >>> >>> Looking at the error message given I expect that the same problem would >>> also affect N3 files since they are using the same reader afaict from >>> the error trace. >>> >>> Regards, >>> >>> Rob Vesse >>> >>> -- >>> PhD Student >>> IAM Group >>> Bay 20, Room 4027, Building 32 >>> Electronics& Computer Science >>> University of Southampton >>> >