asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sandraskarshaug@gmail.com <sandraskarsh...@gmail.com>
Subject Re: Build UDF project (Maven) with large model to deploy to AsterixDB [Error assembling jar]
Date Wed, 28 Nov 2018 09:50:53 GMT
Hi, thanks again Xikui!

I am trying the latter option now – dropping the dependency jars into the /repo folder.
Does it have anything to say where I copy the dependency jars from? 

In addition, I think I should provide some context of my locally run instance of AsterixDB:
- I have cloned the asterixdb repo from github, so I have it local on my Macbook Pro.
- Inside the cloned folder, asterixdb/asterixdb/asterix-server/target/asterix-server-0.9.5-SNAPSHOT-binary-assembly
folder, there lies a folder called apache-asterixdb-0.9.5-SNAPSHOT, which in turn contains
the folders bin, etc, lib, opt and repo.
- It is inside _this_ repo folder I am putting the dependency jars.
- It is from this /opt/local/bin folder I am running sh start-sample-cluster.sh 

So, when following the option 2 example provided in your link [1], it says to repach this
folder into a zip again. I don't quite get this, as this is the folder I am using to run AsterixDB?


Thanks in advance!

Best regards,
Sandra
 

On 2018/11/27 16:38:23, Xikui Wang <xikuiw@uci.edu> wrote: 
> The configuration seems alright, but it's very hard to say where the
> problem is since I haven't had the chance to see what is exactly in your
> lib directory. If this packaging doesn't work for you, you can try to pack
> the dependencies into the UDF jar as a single fat jar, or you can drop the
> dependency jars into the "asterix-server-0.9.*-binary-assembly/repo" directory,
> so they can be distributed with the AsterixDB instance. I would recommend
> the latter method, as you don't have to redeploy the dependency jars every
> time when a UDF changes. These two methods are described in the
> documentation of the UDF template repo [1]. :)
> 
> [1] https://github.com/idleft/asterix-udf-template
> 
> Best,
> Xikui
> 
> On Tue, Nov 27, 2018 at 6:04 AM sandraskarshaug@gmail.com <
> sandraskarshaug@gmail.com> wrote:
> 
> > Thank you for making sense of the log file for me, I managed to get the
> > parameters work!
> >
> > However, a new challenge became evident, of course. The new error that I
> > am seeing (java.lang.ClassNotFoundException in the cc.log when trying to
> > use one of the dependencies in my code). I think this may be happening due
> > to the external dependency, and if it is reachable or not from my UDF when
> > running locally on AsterixDB. Could you explain if my approach for
> > including external dependencies are right or not (approach/steps listed
> > below)?
> >
> > 1. The binary-assembly-libzip.xml looks like this, where the dependencies
> > are included at the bottom:
> >
> > <assembly>
> >   <id>testlib</id>
> >   <formats>
> >     <format>zip</format>
> >   </formats>
> >   <includeBaseDirectory>false</includeBaseDirectory>
> >   <fileSets>
> >     <fileSet>
> >       <directory>target</directory>
> >       <outputDirectory/>
> >       <includes>
> >         <include>*.jar</include>
> >       </includes>
> >     </fileSet>
> >     <fileSet>
> >       <directory>src/main/resources</directory>
> >       <outputDirectory/>
> >       <includes>
> >         <include>library_descriptor.xml</include>
> >       </includes>
> >     </fileSet>
> >   </fileSets>
> >   <dependencySets>
> >     <dependencySet>
> >       <includes>
> >         <include>commons-io:commons-io</include>
> >         <include>ch.qos.logback:logback-core</include>
> >         <include>org.slf4j:slf4j-api</include>
> >         <include>ch.qos.logback:logback-classic</include>
> >         <include>org.deeplearning4j:deeplearning4j-core</include>
> >         <include>org.deeplearning4j:deeplearning4j-modelimport</include>
> >         <include>org.deeplearning4j:deeplearning4j-nlp</include>
> >         <include>org.nd4j:nd4j-api</include>
> >         <include>org.nd4j:nd4j-native</include>
> >       </includes>
> >       <unpack>false</unpack>
> >       <outputDirectory>lib</outputDirectory>
> >     </dependencySet>
> >   </dependencySets>
> > </assembly>
> >
> > 2. When the Maven project is built (mvn clean install), it generates files
> > in /target:
> > - asterix-udf-template-0.1-SNAPSHOT-testlib.zip
> > - asterix-udf-template-0.1-SNAPSHOT.jar
> > - archive-tmp
> > - classes
> > - generated-sources
> > - maven-archiver
> > - maven-status
> >
> > 3. When unzipping the uppermost file (testlib), it contains:
> > - lib (dictionary containing .jars for my dependencies listed above in
> > binary-assembly-libzip.xml)
> > - library_descriptor.xml
> > - asterix-udf-template-0.1-SNAPSHOT.jar
> >
> > 4. And when unzipping the bottommost .jar inside the testlib here, it
> > contains:
> > - my model (model.bin.gz)
> > - library_descriptor.xml
> > - META-INF
> > - org.apache.asterix.external
> > ----> contains my classes
> >
> > Does this look right?
> >
> > I appreciate your help!
> >
> > Best regards,
> > Sandra
> >
> > On 2018/11/27 06:38:58, Xikui Wang <xikuiw@uci.edu> wrote:
> > > Hi Sandra,
> > >
> > > Based on the log, it seems you have an IndexOutOfBoundsException in your
> > > UDF code. Can you double check your UDF at
> > >
> > org.apache.asterix.external.library.RelevanceDetecterFunction.initialize(RelevanceDetecterFunction.java:33)
> > > and your UDF configuration file? You will have to make sure the
> > parameters
> > > are specified properly in the config file, and they are properly accessed
> > > in the initialize method.
> > >
> > > Best,
> > > Xikui
> > >
> > > On Mon, Nov 26, 2018 at 1:33 PM sandraskarshaug@gmail.com <
> > > sandraskarshaug@gmail.com> wrote:
> > >
> > > > Hi Xikui!
> > > >
> > > > So I tried to add the resource as a parameter. However, I get this
> > error
> > > > (gist with log from cc.log) [1] when the query below is executed:
> > > >
> > > > USE feeds;
> > > > CONNECT FEED TestSocketFeed TO DATASET RelevantDataset
> > > > APPLY function testlib#detectRelevance; start feed TestSocketFeed
> > > >
> > > > To provide some context, this query works as it should when I don't
> > > > include the model.
> > > >
> > > > [1]
> > https://gist.github.com/sandraskars/3f707d9b07e5b6c1006368a297b6eacb
> > > >
> > > > Best regards,
> > > > Sandra
> > > >
> > > >
> > > >
> > > > On 2018/11/26 05:45:03, Xikui Wang <xikuiw@uci.edu> wrote:
> > > > > Hi Sandra,
> > > > >
> > > > > Here is an example for adding parameters to a UDF [1]. As you can
> > see,
> > > > the
> > > > > function "KeywordsDetectorFactory" reads a given list path from a
UDF
> > > > > parameter. You can use this to reuse a Java function with different
> > > > > resource files. This function is contained in the AsterixDB release
> > as
> > > > > well. Please make sure the path to the resource file is correct when
> > you
> > > > > use it. That's a tricky part that I always make mistakes.
> > > > >
> > > > > The initialize(), i.e. the model loading, is executed when the "start
> > > > feed"
> > > > > statement is executed. This doesn't require Tweets to come. Is that
> > the
> > > > > case you are referring to?
> > > > >
> > > > > As for your use case, here is an interesting thing that you can try.
> > > > There
> > > > > is a feature in the data feeds which is currently not in our
> > > > documentation,
> > > > > which is to allow you to filter out incoming data by query
> > predicates. If
> > > > > you want to filter out Tweets with the model file that you trained,
> > you
> > > > can
> > > > > attach a Java UDF on your ingestion pipeline with the following
> > query:
> > > > >
> > > > > use test;
> > > > > create type InputRecordType as closed {
> > > > > id:int64,
> > > > > fname:string,
> > > > > lname:string,
> > > > > age:int64,
> > > > > dept:string
> > > > > };
> > > > > create dataset EmpDataset(InputRecordType) primary key id;
> > > > > create feed UserFeed with {
> > > > >     "adapter-name" : "socket_adapter",
> > > > >     "sockets" : "127.0.0.1:10001",
> > > > >     "address-type" : "IP",
> > > > >     "type-name" : "InputRecordType",
> > > > >     "format" : "delimited-text",
> > > > >     "delimiter" : "|",
> > > > >     "upsert-feed" : "true"
> > > > > };
> > > > > *connect feed UserFeed to dataset EmpDataset WHERE
> > > > > testlib#wordDetector(fname) = TRUE;*
> > > > > start feed UserFeed;
> > > > >
> > > > > The Java UDF used here is in [2]. This can help you filter out
> > unwanted
> > > > > incoming data on the pipeline. :)
> > > > >
> > > > > [1]
> > > > >
> > > >
> > https://github.com/idleft/asterix-udf-template/blob/master/src/main/resources/library_descriptor.xml
> > > > >
> > > > > [2]
> > > > >
> > > >
> > https://github.com/idleft/asterix-udf-template/blob/master/src/main/java/org/apache/asterix/external/library/WordInListFunction.java
> > > > >
> > > > > Best,
> > > > > Xikui
> > > > >
> > > > > On Sun, Nov 25, 2018 at 1:05 PM sandraskarshaug@gmail.com <
> > > > > sandraskarshaug@gmail.com> wrote:
> > > > >
> > > > > > Hi Xikui,
> > > > > >
> > > > > > Thanks for your response!
> > > > > > We managed to cope with the problem by using the compressed
> > version of
> > > > the
> > > > > > model instead, but it is still 1.6 GB. However, the project
is
> > able to
> > > > > > build now :-) Yes, this is being packed into the UDF jar at
the
> > > > moment.  Do
> > > > > > you have any examples that illustrates how to use the resource
file
> > > > path as
> > > > > > a UDF parameter? That would be very helpful!
> > > > > >
> > > > > > In addition, I believe that the model loading – which is now
being
> > > > > > executed during initialize() – restrains the incoming tweets
of
> > being
> > > > > > processed. This is evident because none of the streaming elements
> > are
> > > > > > stored in AsterixDB when the model loading is included in the
code,
> > > > whilst
> > > > > > the elements are stored when I exclude the model loading from
the
> > > > code. Is
> > > > > > it possible to make the model load, i.e making initialize()
run,
> > prior
> > > > the
> > > > > > arrival of the tweets at the socketfeed?
> > > > > >
> > > > > > Regarding our project, we are trying to detect tweets which
are
> > > > relevant
> > > > > > for a given "user query", where the goal is crisis detection.
So
> > we are
> > > > > > trying to filter out (i.e _not_ store or keep in the pipeline)
> > tweets
> > > > which
> > > > > > do not contain the relevant location etc. The model I've talked
> > about
> > > > is
> > > > > > being used for word embeddings (word2vec) :-)
> > > > > >
> > > > > > Best regards,
> > > > > > Sandra Skarshaug
> > > > > >
> > > > > >
> > > > > > On 2018/11/24 17:55:27, Xikui Wang <xikuiw@uci.edu> wrote:
> > > > > > > Hi Sandra,
> > > > > > >
> > > > > > > How big is the model file that you are using? I guess you
are
> > trying
> > > > to
> > > > > > > pack this model file into the UDF jar? I personally haven't
seen
> > this
> > > > > > error
> > > > > > > before. It feels like a Maven building with big files issue.
I
> > found
> > > > this
> > > > > > > thread on StackOverflow which describes the similar situation.
> > Could
> > > > you
> > > > > > > try the resolutions there?
> > > > > > >
> > > > > > > As a side note, if you need to use a big model file in
UDF, I
> > > > wouldn't
> > > > > > > suggest you pack that into your UDF jar file. It's because
this
> > will
> > > > > > > significantly slow down your UDF installation, and you
will
> > spend a
> > > > lot
> > > > > > of
> > > > > > > time redeploying the resource file to the cluster if you
only
> > need to
> > > > > > > update the UDF code. Alternatively, you could make the
resource
> > file
> > > > path
> > > > > > > as a UDF parameter, and let the UDF load that file when
it
> > > > initializes.
> > > > > > > This could make the installation much faster and avoid
deploying
> > the
> > > > > > > resource file multiple times, and the packing issue should
be
> > gone as
> > > > > > well.
> > > > > > > :)
> > > > > > >
> > > > > > > PS If it's ok, could you tell us which use case that you
are
> > working
> > > > on?
> > > > > > We
> > > > > > > would like to know how our customers use AsterixDB in different
> > > > > > scenarios,
> > > > > > > so we can help them (you) better!
> > > > > > >
> > > > > > > Best,
> > > > > > > Xikui
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Sat, Nov 24, 2018 at 6:05 AM sandraskarshaug@gmail.com
<
> > > > > > > sandraskarshaug@gmail.com> wrote:
> > > > > > >
> > > > > > > > Hi!
> > > > > > > >
> > > > > > > > My master thesis partner and I have added a model
for word
> > > > embeddings
> > > > > > > > (word2vec) in our project which is quite large. This
is
> > supposed
> > > > to be
> > > > > > > > loaded in the initialize phase of the UDF and be used
for
> > > > evaluating
> > > > > > the
> > > > > > > > incoming records.
> > > > > > > >
> > > > > > > > However, when trying to build the Maven project before
> > deploying
> > > > it to
> > > > > > > > AsterixDB, we get the error "Error assembling JAR,
invalid
> > entry
> > > > > > size". Is
> > > > > > > > this a problem anyone else have faced when for instance
using
> > > > machine
> > > > > > > > learning models in AsterixDB?
> > > > > > > >
> > > > > > > > If so, we appreciate any help!
> > > > > > > >
> > > > > > > > Best regards,
> > > > > > > > Sandra
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> 

Mime
View raw message