asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sandraskarshaug@gmail.com <sandraskarsh...@gmail.com>
Subject Re: Build UDF project (Maven) with large model to deploy to AsterixDB [Error assembling jar]
Date Sun, 25 Nov 2018 21:02:31 GMT
Hi Xikui,

Thanks for your response!
We managed to cope with the problem by using the compressed version of the model instead,
but it is still 1.6 GB. However, the project is able to build now :-) Yes, this is being packed
into the UDF jar at the moment.  Do you have any examples that illustrates how to use the
resource file path as a UDF parameter? That would be very helpful!

In addition, I believe that the model loading – which is now being executed during initialize()
– restrains the incoming tweets of being processed. This is evident because none of the
streaming elements are stored in AsterixDB when the model loading is included in the code,
whilst the elements are stored when I exclude the model loading from the code. Is it possible
to make the model load, i.e making initialize() run, prior the arrival of the tweets at the
socketfeed?

Regarding our project, we are trying to detect tweets which are relevant for a given "user
query", where the goal is crisis detection. So we are trying to filter out (i.e _not_ store
or keep in the pipeline) tweets which do not contain the relevant location etc. The model
I've talked about is being used for word embeddings (word2vec) :-) 

Best regards,
Sandra Skarshaug
 

On 2018/11/24 17:55:27, Xikui Wang <xikuiw@uci.edu> wrote: 
> Hi Sandra,
> 
> How big is the model file that you are using? I guess you are trying to
> pack this model file into the UDF jar? I personally haven't seen this error
> before. It feels like a Maven building with big files issue. I found this
> thread on StackOverflow which describes the similar situation. Could you
> try the resolutions there?
> 
> As a side note, if you need to use a big model file in UDF, I wouldn't
> suggest you pack that into your UDF jar file. It's because this will
> significantly slow down your UDF installation, and you will spend a lot of
> time redeploying the resource file to the cluster if you only need to
> update the UDF code. Alternatively, you could make the resource file path
> as a UDF parameter, and let the UDF load that file when it initializes.
> This could make the installation much faster and avoid deploying the
> resource file multiple times, and the packing issue should be gone as well.
> :)
> 
> PS If it's ok, could you tell us which use case that you are working on? We
> would like to know how our customers use AsterixDB in different scenarios,
> so we can help them (you) better!
> 
> Best,
> Xikui
> 
> 
> 
> On Sat, Nov 24, 2018 at 6:05 AM sandraskarshaug@gmail.com <
> sandraskarshaug@gmail.com> wrote:
> 
> > Hi!
> >
> > My master thesis partner and I have added a model for word embeddings
> > (word2vec) in our project which is quite large. This is supposed to be
> > loaded in the initialize phase of the UDF and be used for evaluating the
> > incoming records.
> >
> > However, when trying to build the Maven project before deploying it to
> > AsterixDB, we get the error "Error assembling JAR, invalid entry size". Is
> > this a problem anyone else have faced when for instance using machine
> > learning models in AsterixDB?
> >
> > If so, we appreciate any help!
> >
> > Best regards,
> > Sandra
> >
> 

Mime
View raw message