asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wail Alkowaileet <wael....@gmail.com>
Subject Re: external data set support
Date Sun, 14 Feb 2016 11:07:13 GMT
One of the papers says that one should add comparators and hash functions
for any new data types introduced by the external data set.  Which
interface does one have to implement for that ?

In addition to Abdullah's, I guess you need also to write SerDer
(ISerializerDeserializer) for those types as well as the output format for
(JSON, LOSSLESS JSON, CSV and ADM) using IPrinterFactory and IPrinter.

On Sun, Feb 14, 2016 at 10:44 AM, abdullah alamoudi <bamousaa@gmail.com>
wrote:

> Hi Sandeep,
> Here are the answers as per my understanding of the questions:
>
> 1) Schema catalog : One would have implement IMetadataProvider,
> IDataSource, IDataSourceIndex and other related classes.  Is there any
> functionality missing from the current schema implementation for external
> data sets ?
> Schema information for external data already exists and we use the
> AqlMetadataProvider for both external and internal datasets.
>
> One of the papers says that one should add comparators and hash functions
> for any new data types introduced by the external data set.  Which
> interface does one have to implement for that ?
> I am not sure which paper you're referring to but for adding new data types
> (regardless for use with internal or external. there is really no
> distinction) here is what needs to be done:
> 1. For complex types, one can simply define a type using the create type
> statement.
> 2. For completely new types, one needs to implement at least {IAType,
> IBinaryComparatorFactory, and IBinaryComparator}. I am not sure if that is
> enough but that is a starting point.
>
> 2) Query optimization : There is no cost-based optimizer yet within
> Algebricks, therefore there is no API to support retrieval and use of table
> statistics from an external data source.
>
> Is something planned in this regard ?
> Cost based optimizer for internal datasets is being worked on (@Ildar might
> add here). As for external data, unfortunately right now, we don't even
> employ some easy rule based optimizations. For example, we can utilize RC
> files structure to push project into data source operator but we don't do
> that yet. Another optimization that can be done is lazy deserialization of
> records but again we don't do that. There are plans to do all of these but
> we have man power shortage. You are welcome to give them a shot and we can
> assist.
>
>
> 3) Data fetch and update : The VLDB'14 paper states that external data sets
> are read-only, static and without indices, but the current codebase has
> support for IExternalIndex and IIndexibleExternalDataSource, so presumably
> I can fetch records from an external data source (base table scan as well
> as index).
> Yes, we can access external data through indexes. probably by the time the
> VLDB'14 paper was published, we didn't have this feature yet. You can check
> http://dl.acm.org/citation.cfm?id=2806428 which is about external data
> access and indexing.
>
> Can I write to an external data source ?
> Right now, this is not supported because we can't provide the same
> transactional guarantees we can with internal datasets. This point probably
> needs to be discussed with Mike before doing anything about it. I believe
> we offer some other thing that can be utilized which is righting query
> results into files but I am not sure.
>
>
> 4) Hyracks runtime : For data retrieval, is it sufficient to implement the
> interfaces within asterix.external.api or does one also have to add some
> Hyracks operators which are constructed via contributeRuntimeOperator ?
>
> For data retrieval, one only needs to implement IExternalDataSourceFactory
> along with IRecordReader<? extends T> or IInputStreamProvider (depending on
> whether the source produces a stream or a set of records).
>
> For data parsing, one only needs to implements IDataParserFactory along
> with IRecordDataParser<T> or IStreamDataParser (depending on whether the
> parsed data source produces a stream or a set of records).
>
> Let me know if I can provide more information.
> Cheers,
> Abdullah.
>
> P.S,
> Thanks for doing your work before asking. This is a great sign :)
>
> Amoudi, Abdullah.
>
> On Sun, Feb 14, 2016 at 10:17 AM, Sandeep Joshi <sanjos100@gmail.com>
> wrote:
>
> > Can someone describe the level of support for External data sets and the
> > future roadmap ?
> >
> > Let me divide the question into four broad issues:
> >
> > 1) Schema catalog : One would have implement IMetadataProvider,
> > IDataSource, IDataSourceIndex and other related classes.  Is there any
> > functionality missing from the current schema implementation for external
> > data sets ?
> >
> > One of the papers says that one should add comparators and hash functions
> > for any new data types introduced by the external data set.  Which
> > interface does one have to implement for that ?
> >
> > 2) Query optimization : There is no cost-based optimizer yet within
> > Algebricks, therefore there is no API to support retrieval and use of
> table
> > statistics from an external data source.
> >
> > Is something planned in this regard ?
> >
> > 3) Data fetch and update : The VLDB'14 paper states that external data
> sets
> > are read-only, static and without indices, but the current codebase has
> > support for IExternalIndex and IIndexibleExternalDataSource, so
> presumably
> > I can fetch records from an external data source (base table scan as well
> > as index).
> >
> > Can I write to an external data source ?
> >
> > 4) Hyracks runtime : For data retrieval, is it sufficient to implement
> the
> > interfaces within asterix.external.api or does one also have to add some
> > Hyracks operators which are constructed via contributeRuntimeOperator ?
> >
> > -Sandeep
> >
>



-- 

*Regards,*
Wail Alkowaileet

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message