asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yingyi Bu <buyin...@gmail.com>
Subject Re: Limitation of the current TweetParser
Date Tue, 23 Feb 2016 06:46:57 GMT
>> As for the cast-record, if we can add advanced type converting that will
be great.

I guess the flow could be a top-level JSON object (tuple) --> fully open
Asterix Record --> record with a required type.
To change the cast-record function, you can take a look at the code here:
https://github.com/apache/incubator-asterixdb/tree/master/asterix-om/src/main/java/org/apache/asterix/om/pointables/cast

Best,
Yingyi


On Mon, Feb 22, 2016 at 10:40 PM, Jianfeng Jia <jianfeng.jia@gmail.com>
wrote:

> I’ve created an issue 1318 <
> https://issues.apache.org/jira/browse/ASTERIXDB-1318> wrt recovering the
> missing fields from the Twitter Stream JSON.
>
> As for the cast-record, if we can add advanced type converting that will
> be great.
>
> > On Feb 22, 2016, at 10:06 PM, Yingyi Bu <buyingyi@gmail.com> wrote:
> >
> >>> Maybe something we'd need for extra credit would be - if the data is
> > targeted at a dataset with "more schema" then the incoming wide open
> > records - >> the ability to do field level type conversions at the point
> of
> > entry into a dataset by calling the appropriate constructors with the
> > incoming string values?
> >
> > I guess we can have an enhanced version of the cast-record function to do
> > that?  It already considers the combination of complex types,
> > open-closeness, and type promotions.  Maybe we can to enhance that with
> > temporal/spatial constructors?
> >
> > Best,
> > Yingyi
> >
> >
> > On Mon, Feb 22, 2016 at 8:50 PM, Mike Carey <dtabass@gmail.com> wrote:
> >
> >> We should definitely not be pulling in a subset of fields at the entry
> >> point - that's what the UDF is for (it can trim off or add or convert
> >> fields) - agreed.  Why not have the out-of-the-box adaptor simply keep
> all
> >> of the fields in their incoming form?  Maybe something we'd need for
> extra
> >> credit would be - if the data is targeted at a dataset with "more
> schema"
> >> then the incoming wide open records - the ability to do field level type
> >> conversions at the point of entry into a dataset by calling the
> appropriate
> >> constructors with the incoming string values?
> >>
> >>
> >> On 2/22/16 4:46 PM, Jianfeng Jia wrote:
> >>
> >>> Dear devs,
> >>>
> >>> TwitterFeedAdapter is nice, but the internal TweetParser have some
> >>> limitations.
> >>> 1. We only pick a few JSON field, e.g. user, geolocation, message
> field.
> >>> I need the place field. Also there are also some other fields the other
> >>> application may also interested in.
> >>> 2. The text fields always call getNormalizedString() to filter out the
> >>> non-ascii chars, which is a big loss of information. Even for the
> English
> >>> txt there are emojis which are not “nomal”
> >>>
> >>> Apparently we can add the entire twitter structure into this parser.
> I’m
> >>> wondering if the current one-to-one mapping between Adapter and Parser
> >>> design is the best approach? The twitter data itself changes. Also
> there
> >>> are a lot of interesting open data resources, e.g. Instagram,FaceBook,
> >>> Weibo, Reddit ….  Could we have a general approach for all these data
> >>> sources?
> >>>
> >>> I’m thinking to have some field level JSON to ADM parsers
> >>> (int,double,string,binary,point,time,polygon…). Then by given the
> schema
> >>> option through Adapter we can easily assemble the field into one
> record.
> >>> The schema option could be a field mapping between original JSON id
> and the
> >>> ADM type, e.g. { “id”:Int64, “user”: { “userid”: int64,..} }.
As such,
> we
> >>> don’t have to write the specific parser for different data source.
> >>>
> >>> Another thoughts is to just give the JSON object as it is, and rely on
> >>> the user’s UDF to parse the data. Again, even in this case, user can
> >>> selectively override several field parsers that are different from
> ours.
> >>>
> >>> Any thoughts?
> >>>
> >>>
> >>> Best,
> >>>
> >>> Jianfeng Jia
> >>> PhD Candidate of Computer Science
> >>> University of California, Irvine
> >>>
> >>>
> >>>
> >>
>
>
>
> Best,
>
> Jianfeng Jia
> PhD Candidate of Computer Science
> University of California, Irvine
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message