jena-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Seaborne <andy.seabo...@epimorphics.com>
Subject Re: Turtle file with UTF-8 BOM fails to parse
Date Sun, 19 Dec 2010 17:45:34 GMT
Rob,

Thanks - and fixed in RIOT (ARQ SVN on SF) and the current Jena readers 
(Jena CVS on SF).  It covers TriG as well.

The same is true for SPARQL - the most direct fix is to skip BOM at the 
start of the parse but to do that requires a grammar change.  That's 
what I did for Turte/N3 in Jena.  But the SPARQL grammar is the grammar 
used to produce the spec HTML and I don't want to contaminate the spec.

For now, I've added a wrapper that will remove a leading BOM so it's 
functionally correct and can remove the wrapper, move the BOM processing 
to the grammar when the spec is frozen.

	Andy

On 18/12/10 13:09, Rob Vesse wrote:
> Hi Andy
>
> I've created a JIRA issue for this -
> https://issues.apache.org/jira/browse/JENA-12
>
> I appreciate the need for minimal, complete examples as I have enough
> trouble getting those out of users on my own support lists
>
> Thanks,
>
> Rob
>
> On Fri, 17 Dec 2010 14:10:09 +0000, Andy Seaborne
> <andy.seaborne@epimorphics.com>  wrote:
>> Hi Rob,
>>
>> Thanks for the minimal, complete, example.
>>
>> The parsers should cope with a UTF-8 BOM even if it's not recommended.
>>
>> Could you raise a JIRA issue for this please (it's the new process!).
>> It'll need fixing in Jena and RIOT.
>>
>> 	Andy
>>
>> On 17/12/10 11:42, Rob Vesse wrote:
>>> Hi all
>>>
>>> I had this issue reported to me recently and have been able to confirm
>>> it myself (example data file attached). Essentially the issue is that if
>>> a Turtle file has a BOM at the start then Jena will refuse to parse it
>>> giving the following error:
>>>
>>> Exception in thread "main"
>>> com.hp.hpl.jena.n3.turtle.TurtleParseException: Lexical error at line 1,
>>> column 2. Encountered: "@" (64), after : "\ufeff"
>>> at com.hp.hpl.jena.n3.turtle.ParserTurtle.parse(ParserTurtle.java:44)
>>> at
>>> com.hp.hpl.jena.n3.turtle.TurtleReader.readWorker(TurtleReader.java:21)
>>> at com.hp.hpl.jena.n3.JenaReaderBase.readImpl(JenaReaderBase.java:101)
>>> at com.hp.hpl.jena.n3.JenaReaderBase.read(JenaReaderBase.java:68)
>>> at com.hp.hpl.jena.rdf.model.impl.ModelCom.read(ModelCom.java:226)
>>> at TurtleWithBOM.main(TurtleWithBOM.java:31)
>>>
>>> The code I used to produce this error was as follows:
>>>
>>> import com.hp.hpl.jena.rdf.model.*;
>>> import com.hp.hpl.jena.util.FileManager;
>>>
>>> import java.io.*;
>>>
>>> public class TurtleWithBOM
>>> {
>>>
>>> public static void main(String[] args)
>>> {
>>>
>>> // create an empty model
>>> Model model = ModelFactory.createDefaultModel();
>>>
>>> InputStream in = FileManager.get().open( "ttl-with-bom.ttl" );
>>> if (in == null)
>>> {
>>> throw new IllegalArgumentException( "File: ttl-with-bom.ttl not found");
>>> }
>>>
>>> // read the Turtle file
>>> model.read(in, "", "TTL");
>>>
>>> // write it to standard out
>>> model.write(System.out);
>>> }
>>> }
>>>
>>> A sample data file used with the above code to reproduce the error is
>>> attached.
>>>
>>> The data files are coming from my software which is all written in .Net
>>> and when outputting in UTF-8 the default behaviour of .Net is to include
>>> the BOM at the start of the file. The BOM is not required for UTF-8 but
>>> it is not forbidden so I think this should be fixed (if possible) for
>>> future releases. I will be modifying my software so that output of the
>>> BOM can be disabled by my users if desired
>>>
>>> Looking at the error message given I expect that the same problem would
>>> also affect N3 files since they are using the same reader afaict from
>>> the error trace.
>>>
>>> Regards,
>>>
>>> Rob Vesse
>>>
>>> --
>>> PhD Student
>>> IAM Group
>>> Bay 20, Room 4027, Building 32
>>> Electronics&   Computer Science
>>> University of Southampton
>>>
>

Mime
View raw message