asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wail Alkowaileet <wael....@gmail.com>
Subject Re: external data set support
Date Wed, 17 Feb 2016 06:12:04 GMT
>From a user perspective:

I think those little features would be really helpful:

About writing to external sources .. Currently I'm using Spark to write the
results to other data sources or in Parquet file format or even converting
CSV to JSON (to avoid Asterix CSV parser limitations). Round-tripping would
be really useful.

There is also the reading from sources ... I extended a bit on the
FileBased adapter to read folder's files ... this seems to be much
friendlier in case of reading large amount of "hdfs"-like (part0000) files.
It's  a bit tedious to write every path of each file. (currently works for
localfs)

Also ... (on my free-time) I'm working in the case of ingesting one large
file from local/NC system ... It seems Asterix spawn threads depends on the
number of files are loaded. So in the case of having one large file ...
there will be only one thread to parse it. I'm not sure about the other way
around (i.e when we have several thousands of files). Would it create
several thousands of threads? it seems like it does, as my Eclipse debugger
crashed.

Within Hyracks: if we can still maintain the parallelism of group by,
distinct by, order by and limit, this would be really awesome! I always try
to avoid them for performance purposes.


On Mon, Feb 15, 2016 at 1:38 AM, Mike Carey <dtabass@gmail.com> wrote:

> Sandeep,
>
> http://dl.acm.org/citation.cfm?id=2806428 is another useful paper to look
> at.
> (This one covers the external data support in more detail than the earlier
> papers.)
>
> We would absolutely love to do some of the things that Abdullah raised
> here, e.g.,
> pushing more selection/projection into the accesses to file formats (like
> Parquet)
> that support and would benefit from that.  What we have now makes it
> possible to
> treat external files as queryable data, but there's lots of room for
> improvement in
> terms of ultimate efficiency - and it would be cool to get others working
> on that.
> A bunch of that, as Abdullah says, doesn't require cost-based optimization
> - just
> optimizer rules to push the pushable criteria into the file access itself
> (as well as
> the runtime support to make that pushing possible).
>
> It would be cool to offer writing to external sources someday - but -
> there are a
> lot of questions that would have to be answered first.  (Producing results
> in various
> file formats would be a great first / non-transaction-requiring step.)
>
> Cheers,
> Mike
>
>
> On 2/13/16 11:44 PM, abdullah alamoudi wrote:
>
>> Hi Sandeep,
>> Here are the answers as per my understanding of the questions:
>>
>> 1) Schema catalog : One would have implement IMetadataProvider,
>> IDataSource, IDataSourceIndex and other related classes.  Is there any
>> functionality missing from the current schema implementation for external
>> data sets ?
>> Schema information for external data already exists and we use the
>> AqlMetadataProvider for both external and internal datasets.
>>
>> One of the papers says that one should add comparators and hash functions
>> for any new data types introduced by the external data set.  Which
>> interface does one have to implement for that ?
>> I am not sure which paper you're referring to but for adding new data
>> types
>> (regardless for use with internal or external. there is really no
>> distinction) here is what needs to be done:
>> 1. For complex types, one can simply define a type using the create type
>> statement.
>> 2. For completely new types, one needs to implement at least {IAType,
>> IBinaryComparatorFactory, and IBinaryComparator}. I am not sure if that is
>> enough but that is a starting point.
>>
>> 2) Query optimization : There is no cost-based optimizer yet within
>> Algebricks, therefore there is no API to support retrieval and use of
>> table
>> statistics from an external data source.
>>
>> Is something planned in this regard ?
>> Cost based optimizer for internal datasets is being worked on (@Ildar
>> might
>> add here). As for external data, unfortunately right now, we don't even
>> employ some easy rule based optimizations. For example, we can utilize RC
>> files structure to push project into data source operator but we don't do
>> that yet. Another optimization that can be done is lazy deserialization of
>> records but again we don't do that. There are plans to do all of these but
>> we have man power shortage. You are welcome to give them a shot and we can
>> assist.
>>
>>
>> 3) Data fetch and update : The VLDB'14 paper states that external data
>> sets
>> are read-only, static and without indices, but the current codebase has
>> support for IExternalIndex and IIndexibleExternalDataSource, so presumably
>> I can fetch records from an external data source (base table scan as well
>> as index).
>> Yes, we can access external data through indexes. probably by the time the
>> VLDB'14 paper was published, we didn't have this feature yet. You can
>> check
>> http://dl.acm.org/citation.cfm?id=2806428 which is about external data
>> access and indexing.
>>
>> Can I write to an external data source ?
>> Right now, this is not supported because we can't provide the same
>> transactional guarantees we can with internal datasets. This point
>> probably
>> needs to be discussed with Mike before doing anything about it. I believe
>> we offer some other thing that can be utilized which is righting query
>> results into files but I am not sure.
>>
>>
>> 4) Hyracks runtime : For data retrieval, is it sufficient to implement the
>> interfaces within asterix.external.api or does one also have to add some
>> Hyracks operators which are constructed via contributeRuntimeOperator ?
>>
>> For data retrieval, one only needs to implement IExternalDataSourceFactory
>> along with IRecordReader<? extends T> or IInputStreamProvider (depending
>> on
>> whether the source produces a stream or a set of records).
>>
>> For data parsing, one only needs to implements IDataParserFactory along
>> with IRecordDataParser<T> or IStreamDataParser (depending on whether the
>> parsed data source produces a stream or a set of records).
>>
>> Let me know if I can provide more information.
>> Cheers,
>> Abdullah.
>>
>> P.S,
>> Thanks for doing your work before asking. This is a great sign :)
>>
>> Amoudi, Abdullah.
>>
>> On Sun, Feb 14, 2016 at 10:17 AM, Sandeep Joshi <sanjos100@gmail.com>
>> wrote:
>>
>> Can someone describe the level of support for External data sets and the
>>> future roadmap ?
>>>
>>> Let me divide the question into four broad issues:
>>>
>>> 1) Schema catalog : One would have implement IMetadataProvider,
>>> IDataSource, IDataSourceIndex and other related classes.  Is there any
>>> functionality missing from the current schema implementation for external
>>> data sets ?
>>>
>>> One of the papers says that one should add comparators and hash functions
>>> for any new data types introduced by the external data set.  Which
>>> interface does one have to implement for that ?
>>>
>>> 2) Query optimization : There is no cost-based optimizer yet within
>>> Algebricks, therefore there is no API to support retrieval and use of
>>> table
>>> statistics from an external data source.
>>>
>>> Is something planned in this regard ?
>>>
>>> 3) Data fetch and update : The VLDB'14 paper states that external data
>>> sets
>>> are read-only, static and without indices, but the current codebase has
>>> support for IExternalIndex and IIndexibleExternalDataSource, so
>>> presumably
>>> I can fetch records from an external data source (base table scan as well
>>> as index).
>>>
>>> Can I write to an external data source ?
>>>
>>> 4) Hyracks runtime : For data retrieval, is it sufficient to implement
>>> the
>>> interfaces within asterix.external.api or does one also have to add some
>>> Hyracks operators which are constructed via contributeRuntimeOperator ?
>>>
>>> -Sandeep
>>>
>>>
>


-- 

*Regards,*
Wail Alkowaileet

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message