asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From abdullah alamoudi <bamou...@gmail.com>
Subject Re: external data set support
Date Sun, 14 Feb 2016 07:44:51 GMT
Hi Sandeep,
Here are the answers as per my understanding of the questions:

1) Schema catalog : One would have implement IMetadataProvider,
IDataSource, IDataSourceIndex and other related classes.  Is there any
functionality missing from the current schema implementation for external
data sets ?
Schema information for external data already exists and we use the
AqlMetadataProvider for both external and internal datasets.

One of the papers says that one should add comparators and hash functions
for any new data types introduced by the external data set.  Which
interface does one have to implement for that ?
I am not sure which paper you're referring to but for adding new data types
(regardless for use with internal or external. there is really no
distinction) here is what needs to be done:
1. For complex types, one can simply define a type using the create type
statement.
2. For completely new types, one needs to implement at least {IAType,
IBinaryComparatorFactory, and IBinaryComparator}. I am not sure if that is
enough but that is a starting point.

2) Query optimization : There is no cost-based optimizer yet within
Algebricks, therefore there is no API to support retrieval and use of table
statistics from an external data source.

Is something planned in this regard ?
Cost based optimizer for internal datasets is being worked on (@Ildar might
add here). As for external data, unfortunately right now, we don't even
employ some easy rule based optimizations. For example, we can utilize RC
files structure to push project into data source operator but we don't do
that yet. Another optimization that can be done is lazy deserialization of
records but again we don't do that. There are plans to do all of these but
we have man power shortage. You are welcome to give them a shot and we can
assist.


3) Data fetch and update : The VLDB'14 paper states that external data sets
are read-only, static and without indices, but the current codebase has
support for IExternalIndex and IIndexibleExternalDataSource, so presumably
I can fetch records from an external data source (base table scan as well
as index).
Yes, we can access external data through indexes. probably by the time the
VLDB'14 paper was published, we didn't have this feature yet. You can check
http://dl.acm.org/citation.cfm?id=2806428 which is about external data
access and indexing.

Can I write to an external data source ?
Right now, this is not supported because we can't provide the same
transactional guarantees we can with internal datasets. This point probably
needs to be discussed with Mike before doing anything about it. I believe
we offer some other thing that can be utilized which is righting query
results into files but I am not sure.


4) Hyracks runtime : For data retrieval, is it sufficient to implement the
interfaces within asterix.external.api or does one also have to add some
Hyracks operators which are constructed via contributeRuntimeOperator ?

For data retrieval, one only needs to implement IExternalDataSourceFactory
along with IRecordReader<? extends T> or IInputStreamProvider (depending on
whether the source produces a stream or a set of records).

For data parsing, one only needs to implements IDataParserFactory along
with IRecordDataParser<T> or IStreamDataParser (depending on whether the
parsed data source produces a stream or a set of records).

Let me know if I can provide more information.
Cheers,
Abdullah.

P.S,
Thanks for doing your work before asking. This is a great sign :)

Amoudi, Abdullah.

On Sun, Feb 14, 2016 at 10:17 AM, Sandeep Joshi <sanjos100@gmail.com> wrote:

> Can someone describe the level of support for External data sets and the
> future roadmap ?
>
> Let me divide the question into four broad issues:
>
> 1) Schema catalog : One would have implement IMetadataProvider,
> IDataSource, IDataSourceIndex and other related classes.  Is there any
> functionality missing from the current schema implementation for external
> data sets ?
>
> One of the papers says that one should add comparators and hash functions
> for any new data types introduced by the external data set.  Which
> interface does one have to implement for that ?
>
> 2) Query optimization : There is no cost-based optimizer yet within
> Algebricks, therefore there is no API to support retrieval and use of table
> statistics from an external data source.
>
> Is something planned in this regard ?
>
> 3) Data fetch and update : The VLDB'14 paper states that external data sets
> are read-only, static and without indices, but the current codebase has
> support for IExternalIndex and IIndexibleExternalDataSource, so presumably
> I can fetch records from an external data source (base table scan as well
> as index).
>
> Can I write to an external data source ?
>
> 4) Hyracks runtime : For data retrieval, is it sufficient to implement the
> interfaces within asterix.external.api or does one also have to add some
> Hyracks operators which are constructed via contributeRuntimeOperator ?
>
> -Sandeep
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message