Mailing-List: contact issues-help@commons.apache.org; run by ezmlm
Precedence: bulk
Reply-To: issues@commons.apache.org
Date: Wed, 29 Oct 2014 05:35:33 +0000 (UTC)
From: "Gary Gregory (JIRA)" <jira@apache.org>
To: issues@commons.apache.org
Message-ID: <JIRA.12739796.1410127520000.360957.1414560933741@Atlassian.JIRA>
In-Reply-To: <JIRA.12739796.1410127520000@Atlassian.JIRA>
References: <JIRA.12739796.1410127520000@Atlassian.JIRA>
 <JIRA.12739796.1410127520705@arcas>
Subject: [jira] [Comment Edited] (CSV-131) Save positions of records to
 enable random access
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/CSV-131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14136714#comment-14136714 ] 

Gary Gregory edited comment on CSV-131 at 10/29/14 5:35 AM:
------------------------------------------------------------

The basement changes are in place and seem uncontroversial. The attached patch {{ggregory-CSV-131-parser-and-record.diff}} completes the changes in the same sense that the original "Full" patch did. What I am not sure about is whether this is the right design. The usage in the unit test is: Create a new parser with the given CSV data string and tell the parser what character position and record position the data really refers to. This is like saying: Parse this new CSV data but start counting characters as X and start counting records at Y. This feels funny. Why not just say, skip to record R or skip to char position P? I'd like feedback from the other CSV developers.


was (Author: garydgregory):
The basement changes are in place and seem uncontroversial. The attached patch {{ggregory-CSV-131-parser-and-record.diff}} completes the changes in the same sense that the original "Full" patch did. What I am not sure about is whether this is the right design. The usage in the unit test is: Create a new parser with the give CSV data string and tell the parser what character position and record position the data really refers to. This is like saying: Parse this new CSV data but start counting characters as X and start counting records at Y. This feels funny. Why not just say, skip to record R or skip to char position P? I'd like feedback from the other CSV developers.

> Save positions of records to enable random access
> -------------------------------------------------
>
>                 Key: CSV-131
>                 URL: https://issues.apache.org/jira/browse/CSV-131
>             Project: Commons CSV
>          Issue Type: Improvement
>          Components: Parser
>    Affects Versions: 1.1
>            Reporter: Holger Stratmann
>            Priority: Minor
>         Attachments: CSV-131-gg-0.diff, PositionTrackingFull_v101_20140910.patch, PositionTrackingTest_20140907.patch, PositionTracking_20140907.patch, ggregory-CSV-131-parser-and-record.diff
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> It would be good to have {{CSVRecord}} save its position in the source stream.
> Reason: Knowing the position of the records would enable random access to retrieve records from the source (after reading it once to build an index) if the file is too large to be read into memory (or if we don't want to read the full file to access a record in the middle).
> Additional info: I have created a "random access csv reader" and a "csv viewer" (Swing) for arbitrarily large CSV files. It requires one additional scan of the file to build an index (multi-byte charsets supported). The index can be saved to a file so it only needs to be built once. Because the lexer uses a BufferedReader, we need "internal information" to know where each record starts.
> The change to "core" is minor: one field in {{CSVRecord}}s and some associated methods to store the position.
> Patch will be attached.
> Code for random access (both UI and non-UI) will be proposed (and possibly submitted) as a separate issue. It could also be an independent add-on but requires this one little change to Commons CSV.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)