asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chen Li <che...@gmail.com>
Subject Re: loading CSV records with comma in the value
Date Sat, 08 Aug 2015 05:00:55 GMT
Taewoo helped me look into the issue.  To finish this discussion, it
was because I was using an old Asterix version.  The current master
branch can parse CSV files properly.

Chen

On Sun, Jul 26, 2015 at 11:25 PM, Taewoo Kim <wangsaeu@gmail.com> wrote:
> @Chen: the format of your data file is not correct. In fact, after the
> delimiter (,), the quote should be followed based on CSV RFC. However, in
> your example, a white space exists. In fact, I saw the following error
> message, which complains about the file format. After removing a white
> space after the delimiter, it worked fine. So, if you correct the file
> format, it should work.
>
> At record: 1, field#: 2 - a quote enclosing a field needs to be placed in
> the beginning of that field. [IOException]
>
>
> [ { "id": 14i32, "authors": "John Smith, Mary Reeve" }
>  ]
>
>
>
> Best,
> Taewoo
>
> On Sun, Jul 26, 2015 at 10:47 PM, Chen Li <chenli@gmail.com> wrote:
>
>> I added the following line
>>
>> ("quote"="\"")
>>
>> to the load statement, but the problem remains: it mistakenly used the
>> "," in the "authors" field to break the record.
>>
>> @Taewoo: can you try the simple AQL example I included in this thread
>> to see if it can parse the quoted field correctly?
>>
>> Chen
>>
>> On Sun, Jul 26, 2015 at 1:25 PM, Taewoo Kim <wangsaeu@gmail.com> wrote:
>> > We have test cases for this case. There are located in
>> > asterix-app/src/test/resources/runtimets/queries/load/.  The
>> documentation
>> > is in the /asterix-doc/src/site/markdown/csv.md. Addtional syntax for
>> the
>> > CSV is fairly simple. You just have two additional parameters - "quote"
>> and
>> > "header". Refer to the file for more details.
>> >
>> >
>> >
>> > Best,
>> > Taewoo
>> >
>> > On Sat, Jul 25, 2015 at 11:30 PM, Chen Li <chenli@gmail.com> wrote:
>> >
>> >> @Taewoo: I tried it and it has the same problem.  Do you have a test
>> >> case for this feature?  Also do we have documentation for this syntax?
>> >>
>> >> Chen
>> >>
>> >> On Sat, Jul 25, 2015 at 10:52 PM, Taewoo Kim <wangsaeu@gmail.com>
>> wrote:
>> >> > The URL is
>> https://asterixdb.ics.uci.edu/documentation/aql/primer.html.
>> >> >
>> >> >
>> >> > It should look like this:
>> >> >
>> >> > ////
>> >> > use dataverse pubs;
>> >> >
>> >> > create type PaperType as open {
>> >> >    id: int32,
>> >> >    authors: string
>> >> > }
>> >> >
>> >> > create dataset Papers(PaperType) primary key id;
>> >> >
>> >> > load dataset Papers using localfs
>> >> >      using localfs
>> >> > (("path"="127.0.01:///Users/chenli/tmp/asterix-data/papers.csv"),
>> >> >    ("format"="delimited-text"),
>> >> >    ("delimiter"=","));
>> >> >
>> >> > for $paper in dataset('Papers')
>> >> > return $paper;
>> >> >
>> >> >
>> >> >
>> >> > Best,
>> >> > Taewoo
>> >> >
>> >> > On Sat, Jul 25, 2015 at 10:47 PM, Chen Li <chenli@gmail.com>
wrote:
>> >> >
>> >> >> @Taewoo: can you send me the syntax or the documentation URL to
show
>> the
>> >> >> syntax?
>> >> >>
>> >> >> Chen
>> >> >>
>> >> >> On Sat, Jul 25, 2015 at 3:27 PM, Taewoo Kim <wangsaeu@gmail.com>
>> wrote:
>> >> >> > Can you try to load it into an internal dataset? I think I
have
>> >> >> implemented
>> >> >> > the "comma between the comma (delimiter)" when modifying the
>> delimited
>> >> >> data
>> >> >> > parser. And Chris also modified that part, too. If it doesn't
>> work, I
>> >> can
>> >> >> > look at the issue.
>> >> >> >
>> >> >> > Best,
>> >> >> > Taewoo
>> >> >> >
>> >> >> > On Sat, Jul 25, 2015 at 1:51 PM, Chen Li <chenli@gmail.com>
wrote:
>> >> >> >
>> >> >> >> Not sure if this topic was discussed before.  I was trying
to
>> load an
>> >> >> >> external CVS file using "," as the delimiter.  But the
engine
>> failed
>> >> to
>> >> >> >> read a file with the following single record:
>> >> >> >>
>> >> >> >> 14, "John Smith, Mary Reeve"
>> >> >> >>
>> >> >> >>
>> >> >> >> use dataverse pubs;
>> >> >> >>
>> >> >> >>    create type PaperType as open {
>> >> >> >>       id: int32,
>> >> >> >>        authors: string
>> >> >> >>    }
>> >> >> >>
>> >> >> >> create external dataset Papers(PaperType)
>> >> >> >>    using localfs
>> >> >> >> (("path"="127.0.01:///Users/chenli/tmp/asterix-data/papers.csv"),
>> >> >> >>    ("format"="delimited-text"),
>> >> >> >>    ("delimiter"=","));
>> >> >> >>
>> >> >> >> for $paper in dataset('Papers')
>> >> >> >> return $paper;
>> >> >> >>
>> >> >> >> The following is the output, which shows that the comma
in the
>> >> authors
>> >> >> >> field was incorrectly used to break the field.  Any idea
about
>> how to
>> >> >> fix
>> >> >> >> it?
>> >> >> >>
>> >> >> >> Output
>> >> >> >> Results:
>> >> >> >>
>> >> >> >> { "id": 14, "authors": " \"John Smith" }
>> >> >> >>
>> >> >> >> Duration of all jobs: 0.091 sec
>> >> >> >>
>> >> >> >> Success: Query Complete
>> >> >> >>
>> >> >>
>> >>
>>

Mime
View raw message