incubator-any23-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hannes Mühleisen (Updated) (JIRA) <j...@apache.org>
Subject [jira] [Updated] (ANY23-49) N3/NQ parsers ignoring stopAtFirstError flag
Date Fri, 17 Feb 2012 12:17:59 GMT

     [ https://issues.apache.org/jira/browse/ANY23-49?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Hannes Mühleisen updated ANY23-49:
----------------------------------

    Description: 
The base interface for all RDF parsers (org.openrdf.rio.RDFParser) defines a method setStopAtFirstError.
The documentation for this methods reads as "Sets whether the parser should stop immediately
if it finds an error in the data". This is indeed very useful, as many data sets "out there"
contain an amount of malformed entries.

However, as far as I can tell from the current source code (0.6.1 and SVN trunk), the NQuadsParser
(org.deri.any23.parser.NQuadsParser) ignores this flag. In its original implementation, it
runs through the entire input in an unchecked loop as such:

while(parseLine(fileReader)) {
    nextRow();
}

Now, if the parsing of any line in a potential huge file throws an exception, the entire parsing
process stops regardless of the setting of the "stopAtFirstError" flag. I propose these loops
to be changed to honor this flag, so that when it is set to "false", the rest of the line
is discarded and the parsing process can continue with the next line.

I have implemented this behavior on the latest version of NQuadsParser from SVN (r1601), the
source file is attached. I have changed the parseLine() method as follows:

private boolean parseLine(BufferedReader br) throws IOException,
			RDFParseException, RDFHandlerException {
    // [...]
    try {
        // [...]
        // notifiyStatement moved into try block
        notifyStatement(sub, pred, obj, graph);
    } catch (EOS eos) {
        reportFatalError("Unexpected end of line.", row, col);
        throw new IllegalStateException();
    } catch (IllegalArgumentException iae) {
        if (!stopAtFirstError()) {
            // remove remainder of broken line
            consumeBrokenLine(br);
            // notify parse error listener
            reportError(iae.getMessage(), row, col);
        } else {
            throw new RDFParseException(iae);
        }
    }
    // [...]
}

private void consumeBrokenLine(BufferedReader br) throws IOException {
    char c;
    while (true) {
        mark(br);
        c = readChar(br);
        if (c == '\n') {
            return;
        }
    }
}

It would be great if this or similar changes would find their way into the various Any23 RDF
parsers.



  was:
The base interface for all RDF parsers (org.openrdf.rio.RDFParser) defines a method setStopAtFirstError.
The documentation for this methods reads as "Sets whether the parser should stop immediately
if it finds an error in the data". This is indeed very useful, as many data sets "out there"
contain an amount of malformed entries.

However, as far as I can tell from the current source code (0.6.1 and SVN trunk), both the
NTriples parser (org.openrdf.rio.ntriples.NTriplesParser.NTriplesParser) and the NQuadsParser
(org.deri.any23.parser.NQuadsParser) ignore this flag. In their respective implementations,
they run through the entire files in an unchecked loop (see http://code.google.com/p/any23/source/browse/trunk/any23-core/src/main/java/org/deri/any23/io/nquads/NQuadsParser.java#100).


while(parseLine(fileReader)) {
    nextRow();
}

Now, if the parsing of any line in a potential huge file throws an exception, the entire parsing
process stops regardless of the setting of the "stopAtFirstError" flag. I propose these loops
to be changed to honor this flag, so that when it is set to "false", the rest of the line
is discarded and the parsing process can continue with the next line.


I have implemented this behavior on the latest version of NQuadsParser from SVN (r1601), the
source file is attached. I have changed the parseLine() method as follows:

private boolean parseLine(BufferedReader br) throws IOException,
			RDFParseException, RDFHandlerException {
    // [...]
    try {
        // [...]
        // notifiyStatement moved into try block
        notifyStatement(sub, pred, obj, graph);
    } catch (EOS eos) {
        reportFatalError("Unexpected end of line.", row, col);
        throw new IllegalStateException();
    } catch (IllegalArgumentException iae) {
        if (!stopAtFirstError()) {
            // remove remainder of broken line
            consumeBrokenLine(br);
            // notify parse error listener
            reportError(iae.getMessage(), row, col);
        } else {
            throw new RDFParseException(iae);
        }
    }
    // [...]
}

private void consumeBrokenLine(BufferedReader br) throws IOException {
    char c;
    while (true) {
        mark(br);
        c = readChar(br);
        if (c == '\n') {
            return;
        }
    }
}

It would be great if this or similar changes would find their way into the Any23 parsers.



    
> N3/NQ parsers ignoring stopAtFirstError flag
> --------------------------------------------
>
>                 Key: ANY23-49
>                 URL: https://issues.apache.org/jira/browse/ANY23-49
>             Project: Apache Any23
>          Issue Type: Bug
>         Environment: Any23 0.6.1 and repository
>            Reporter: Hannes Mühleisen
>         Attachments: RobustNquadsParser.java
>
>
> The base interface for all RDF parsers (org.openrdf.rio.RDFParser) defines a method setStopAtFirstError.
The documentation for this methods reads as "Sets whether the parser should stop immediately
if it finds an error in the data". This is indeed very useful, as many data sets "out there"
contain an amount of malformed entries.
> However, as far as I can tell from the current source code (0.6.1 and SVN trunk), the
NQuadsParser (org.deri.any23.parser.NQuadsParser) ignores this flag. In its original implementation,
it runs through the entire input in an unchecked loop as such:
> while(parseLine(fileReader)) {
>     nextRow();
> }
> Now, if the parsing of any line in a potential huge file throws an exception, the entire
parsing process stops regardless of the setting of the "stopAtFirstError" flag. I propose
these loops to be changed to honor this flag, so that when it is set to "false", the rest
of the line is discarded and the parsing process can continue with the next line.
> I have implemented this behavior on the latest version of NQuadsParser from SVN (r1601),
the source file is attached. I have changed the parseLine() method as follows:
> private boolean parseLine(BufferedReader br) throws IOException,
> 			RDFParseException, RDFHandlerException {
>     // [...]
>     try {
>         // [...]
>         // notifiyStatement moved into try block
>         notifyStatement(sub, pred, obj, graph);
>     } catch (EOS eos) {
>         reportFatalError("Unexpected end of line.", row, col);
>         throw new IllegalStateException();
>     } catch (IllegalArgumentException iae) {
>         if (!stopAtFirstError()) {
>             // remove remainder of broken line
>             consumeBrokenLine(br);
>             // notify parse error listener
>             reportError(iae.getMessage(), row, col);
>         } else {
>             throw new RDFParseException(iae);
>         }
>     }
>     // [...]
> }
> private void consumeBrokenLine(BufferedReader br) throws IOException {
>     char c;
>     while (true) {
>         mark(br);
>         c = readChar(br);
>         if (c == '\n') {
>             return;
>         }
>     }
> }
> It would be great if this or similar changes would find their way into the various Any23
RDF parsers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

Mime
View raw message