jena-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Seaborne <andy.seabo...@epimorphics.com>
Subject Re: Turtle file with UTF-8 BOM fails to parse
Date Fri, 17 Dec 2010 14:10:09 GMT
Hi Rob,

Thanks for the minimal, complete, example.

The parsers should cope with a UTF-8 BOM even if it's not recommended.

Could you raise a JIRA issue for this please (it's the new process!). 
It'll need fixing in Jena and RIOT.

	Andy

On 17/12/10 11:42, Rob Vesse wrote:
> Hi all
>
> I had this issue reported to me recently and have been able to confirm
> it myself (example data file attached). Essentially the issue is that if
> a Turtle file has a BOM at the start then Jena will refuse to parse it
> giving the following error:
>
> Exception in thread "main"
> com.hp.hpl.jena.n3.turtle.TurtleParseException: Lexical error at line 1,
> column 2. Encountered: "@" (64), after : "\ufeff"
> at com.hp.hpl.jena.n3.turtle.ParserTurtle.parse(ParserTurtle.java:44)
> at com.hp.hpl.jena.n3.turtle.TurtleReader.readWorker(TurtleReader.java:21)
> at com.hp.hpl.jena.n3.JenaReaderBase.readImpl(JenaReaderBase.java:101)
> at com.hp.hpl.jena.n3.JenaReaderBase.read(JenaReaderBase.java:68)
> at com.hp.hpl.jena.rdf.model.impl.ModelCom.read(ModelCom.java:226)
> at TurtleWithBOM.main(TurtleWithBOM.java:31)
>
> The code I used to produce this error was as follows:
>
> import com.hp.hpl.jena.rdf.model.*;
> import com.hp.hpl.jena.util.FileManager;
>
> import java.io.*;
>
> public class TurtleWithBOM
> {
>
> public static void main(String[] args)
> {
>
> // create an empty model
> Model model = ModelFactory.createDefaultModel();
>
> InputStream in = FileManager.get().open( "ttl-with-bom.ttl" );
> if (in == null)
> {
> throw new IllegalArgumentException( "File: ttl-with-bom.ttl not found");
> }
>
> // read the Turtle file
> model.read(in, "", "TTL");
>
> // write it to standard out
> model.write(System.out);
> }
> }
>
> A sample data file used with the above code to reproduce the error is
> attached.
>
> The data files are coming from my software which is all written in .Net
> and when outputting in UTF-8 the default behaviour of .Net is to include
> the BOM at the start of the file. The BOM is not required for UTF-8 but
> it is not forbidden so I think this should be fixed (if possible) for
> future releases. I will be modifying my software so that output of the
> BOM can be disabled by my users if desired
>
> Looking at the error message given I expect that the same problem would
> also affect N3 files since they are using the same reader afaict from
> the error trace.
>
> Regards,
>
> Rob Vesse
>
> --
> PhD Student
> IAM Group
> Bay 20, Room 4027, Building 32
> Electronics&  Computer Science
> University of Southampton
>

Mime
View raw message