asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sandraskarshaug@gmail.com <sandraskarsh...@gmail.com>
Subject Re: Build UDF project (Maven) with large model to deploy to AsterixDB [Error assembling jar]
Date Mon, 26 Nov 2018 21:31:29 GMT
Hi Xikui!

So I tried to add the resource as a parameter. However, I get this error (gist with log from
cc.log) [1] when the query below is executed:

USE feeds; 
CONNECT FEED TestSocketFeed TO DATASET RelevantDataset
APPLY function testlib#detectRelevance; start feed TestSocketFeed

To provide some context, this query works as it should when I don't include the model. 

[1] https://gist.github.com/sandraskars/3f707d9b07e5b6c1006368a297b6eacb

Best regards,
Sandra



On 2018/11/26 05:45:03, Xikui Wang <xikuiw@uci.edu> wrote: 
> Hi Sandra,
> 
> Here is an example for adding parameters to a UDF [1]. As you can see, the
> function "KeywordsDetectorFactory" reads a given list path from a UDF
> parameter. You can use this to reuse a Java function with different
> resource files. This function is contained in the AsterixDB release as
> well. Please make sure the path to the resource file is correct when you
> use it. That's a tricky part that I always make mistakes.
> 
> The initialize(), i.e. the model loading, is executed when the "start feed"
> statement is executed. This doesn't require Tweets to come. Is that the
> case you are referring to?
> 
> As for your use case, here is an interesting thing that you can try. There
> is a feature in the data feeds which is currently not in our documentation,
> which is to allow you to filter out incoming data by query predicates. If
> you want to filter out Tweets with the model file that you trained, you can
> attach a Java UDF on your ingestion pipeline with the following query:
> 
> use test;
> create type InputRecordType as closed {
> id:int64,
> fname:string,
> lname:string,
> age:int64,
> dept:string
> };
> create dataset EmpDataset(InputRecordType) primary key id;
> create feed UserFeed with {
>     "adapter-name" : "socket_adapter",
>     "sockets" : "127.0.0.1:10001",
>     "address-type" : "IP",
>     "type-name" : "InputRecordType",
>     "format" : "delimited-text",
>     "delimiter" : "|",
>     "upsert-feed" : "true"
> };
> *connect feed UserFeed to dataset EmpDataset WHERE
> testlib#wordDetector(fname) = TRUE;*
> start feed UserFeed;
> 
> The Java UDF used here is in [2]. This can help you filter out unwanted
> incoming data on the pipeline. :)
> 
> [1]
> https://github.com/idleft/asterix-udf-template/blob/master/src/main/resources/library_descriptor.xml
> 
> [2]
> https://github.com/idleft/asterix-udf-template/blob/master/src/main/java/org/apache/asterix/external/library/WordInListFunction.java
> 
> Best,
> Xikui
> 
> On Sun, Nov 25, 2018 at 1:05 PM sandraskarshaug@gmail.com <
> sandraskarshaug@gmail.com> wrote:
> 
> > Hi Xikui,
> >
> > Thanks for your response!
> > We managed to cope with the problem by using the compressed version of the
> > model instead, but it is still 1.6 GB. However, the project is able to
> > build now :-) Yes, this is being packed into the UDF jar at the moment.  Do
> > you have any examples that illustrates how to use the resource file path as
> > a UDF parameter? That would be very helpful!
> >
> > In addition, I believe that the model loading – which is now being
> > executed during initialize() – restrains the incoming tweets of being
> > processed. This is evident because none of the streaming elements are
> > stored in AsterixDB when the model loading is included in the code, whilst
> > the elements are stored when I exclude the model loading from the code. Is
> > it possible to make the model load, i.e making initialize() run, prior the
> > arrival of the tweets at the socketfeed?
> >
> > Regarding our project, we are trying to detect tweets which are relevant
> > for a given "user query", where the goal is crisis detection. So we are
> > trying to filter out (i.e _not_ store or keep in the pipeline) tweets which
> > do not contain the relevant location etc. The model I've talked about is
> > being used for word embeddings (word2vec) :-)
> >
> > Best regards,
> > Sandra Skarshaug
> >
> >
> > On 2018/11/24 17:55:27, Xikui Wang <xikuiw@uci.edu> wrote:
> > > Hi Sandra,
> > >
> > > How big is the model file that you are using? I guess you are trying to
> > > pack this model file into the UDF jar? I personally haven't seen this
> > error
> > > before. It feels like a Maven building with big files issue. I found this
> > > thread on StackOverflow which describes the similar situation. Could you
> > > try the resolutions there?
> > >
> > > As a side note, if you need to use a big model file in UDF, I wouldn't
> > > suggest you pack that into your UDF jar file. It's because this will
> > > significantly slow down your UDF installation, and you will spend a lot
> > of
> > > time redeploying the resource file to the cluster if you only need to
> > > update the UDF code. Alternatively, you could make the resource file path
> > > as a UDF parameter, and let the UDF load that file when it initializes.
> > > This could make the installation much faster and avoid deploying the
> > > resource file multiple times, and the packing issue should be gone as
> > well.
> > > :)
> > >
> > > PS If it's ok, could you tell us which use case that you are working on?
> > We
> > > would like to know how our customers use AsterixDB in different
> > scenarios,
> > > so we can help them (you) better!
> > >
> > > Best,
> > > Xikui
> > >
> > >
> > >
> > > On Sat, Nov 24, 2018 at 6:05 AM sandraskarshaug@gmail.com <
> > > sandraskarshaug@gmail.com> wrote:
> > >
> > > > Hi!
> > > >
> > > > My master thesis partner and I have added a model for word embeddings
> > > > (word2vec) in our project which is quite large. This is supposed to be
> > > > loaded in the initialize phase of the UDF and be used for evaluating
> > the
> > > > incoming records.
> > > >
> > > > However, when trying to build the Maven project before deploying it to
> > > > AsterixDB, we get the error "Error assembling JAR, invalid entry
> > size". Is
> > > > this a problem anyone else have faced when for instance using machine
> > > > learning models in AsterixDB?
> > > >
> > > > If so, we appreciate any help!
> > > >
> > > > Best regards,
> > > > Sandra
> > > >
> > >
> >
> 

Mime
View raw message