asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Carey <dtab...@gmail.com>
Subject Re: Do we have a method to append local files to existed dataset?
Date Sun, 06 Mar 2016 00:23:43 GMT
:-)  Thx!!

On 3/5/16 2:12 AM, abdullah alamoudi wrote:
> Not hard at all. (about 5 minutes of work).
>
> Will create a change for it.
>
> On Sat, Mar 5, 2016 at 1:31 AM, Mike Carey <dtabass@gmail.com> wrote:
>
>> It would be nice to have the parallelism of loading be
>> dataset-property-determined rather than number-of-input-files determined
>> (e.g., min(number of partitions, number of input files)) and then have the
>> leaves of the load job each handle a delegated list of files.  How hard
>> would that be?  :-)
>>
>> On 3/4/16 2:04 PM, Young-Seok Kim wrote:
>>
>>> That makes sense.
>>>
>>> Cheers,
>>> Young-Seok
>>>
>>> On Fri, Mar 4, 2016 at 1:48 PM, Yingyi Bu <buyingyi@gmail.com> wrote:
>>>
>>> Young-Seok,
>>>> That works when the number of local files is relatively small.
>>>> However, when the number of localfs files is 1000,  the 1000 files will
>>>> be
>>>> loaded in parallel simultaneously, which will exhaust all system
>>>> resources.
>>>> Loading from HDFS doesn't have the problem because the 1000 (or more)
>>>> file
>>>> splits will be queued into each parallel loader.
>>>>
>>>> Best,
>>>> Yingyi
>>>>
>>>>
>>>> On Fri, Mar 4, 2016 at 1:42 PM, Young-Seok Kim <kisskys@gmail.com>
>>>> wrote:
>>>>
>>>> You can also load multiple adm files into a same dataset with a single
>>>> AQL
>>>>
>>>>> as follows:
>>>>>
>>>>> load dataset Tweets
>>>>>
>>>>> using "org.apache.asterix.external.dataset.adapter.NCFileSystemAdapter"
>>>>>
>>>>> (("path"=
>>>>>
>>>>> "130.149.249.60
>>>>>
>>>>>
>>>>>
>>>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi27-pid0.adm,
>>>>
>>>>> 130.149.249.53
>>>>>
>>>>>
>>>>>
>>>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi26-pid1.adm,
>>>>
>>>>> 130.149.249.54
>>>>>
>>>>>
>>>>>
>>>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi25-pid2.adm,
>>>>
>>>>> 130.149.249.55
>>>>>
>>>>>
>>>>>
>>>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi24-pid3.adm,
>>>>
>>>>> 130.149.249.56
>>>>>
>>>>>
>>>>>
>>>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi23-pid4.adm,
>>>>
>>>>> 130.149.249.57
>>>>>
>>>>>
>>>>>
>>>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi22-pid5.adm,
>>>>
>>>>> 130.149.249.58
>>>>>
>>>>>
>>>>>
>>>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi21-pid6.adm,
>>>>
>>>>> 130.149.249.59
>>>>>
>>>>>
>>>>>
>>>> :///data/seok.kim/spatial-index-experiment/files/SyntheticTweetsRectangleHouse200M-psi20-pid7.adm"),
>>>>
>>>>> ("format"="adm"));
>>>>>
>>>>>
>>>>> The above AQL loads 8 adm files into a single dataset named Tweets.
>>>>>
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Young-Seok
>>>>>
>>>>> On Fri, Mar 4, 2016 at 12:19 PM, Xikui Wang <xikuiw@uci.edu> wrote:
>>>>>
>>>>> Hi Yingyi,
>>>>>> Thanks for your reply. I think the external dataset with scan query
is
>>>>>>
>>>>> a
>>>>> good solution.
>>>>>> I will try that. Thank you.
>>>>>>
>>>>>> Best,
>>>>>> Xikui
>>>>>>
>>>>>> On Fri, Mar 4, 2016 at 11:53 AM, Yingyi Bu <buyingyi@gmail.com>
wrote:
>>>>>>
>>>>>> Xikui,
>>>>>>> If the number of localfs files is too large,  a solution could
be to
>>>>>>>
>>>>>> put
>>>>>> your files on HDFS and then load it.  Loading from HDFS always has
a
>>>>>> fixed
>>>>>>
>>>>>>> degree of parallelism regardless of the number of files.
>>>>>>>
>>>>>>> I am wondering is there a way to append adm file to existed
>>>>>>>> dataset?
>>>>> You can create an external dataset and then write an insert statement
>>>>>> where
>>>>>>
>>>>>>> the body is a scan query. AsterixDB doesn't load any data into
its
>>>>>>>
>>>>>> own
>>>>> storage for an external dataset but just keeps file paths.
>>>>>>> Here is a manual for external datasets:
>>>>>>> https://ci.apache.org/projects/asterixdb/aql/externaldata.html
>>>>>>>
>>>>>>> Best,
>>>>>>> Yingyi
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Mar 4, 2016 at 11:47 AM, Xikui Wang <xikuiw@uci.edu>
wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>> I want to import data from multiple adm files into a same
dataset.
>>>>>>>>
>>>>>>> Merging
>>>>>>>
>>>>>>>> them together and then loading from localfs can be a viable
>>>>>>>>
>>>>>>> solution,
>>>>> but
>>>>>>> this may become a problem when the number become too large. I
am
>>>>>>> wondering
>>>>>>>
>>>>>>>> is there a way to append adm file to existed dataset?
>>>>>>>>
>>>>>>>> Thank you.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Xikui
>>>>>>>>
>>>>>>>>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message