asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xikui Wang <xik...@uci.edu>
Subject Re: Build UDF project (Maven) with large model to deploy to AsterixDB [Error assembling jar]
Date Wed, 28 Nov 2018 17:03:31 GMT
Hi Sandra,

If you are following the binary-assembly-libzip.xml that you showed to me
earlier, the specified dependency jars should be under the lib directory in
your compiled UDF package, i.e., "- lib (dictionary containing .jars for my
dependencies listed above in binary-assembly-libzip.xml)". You can copy all
the jar files in this directory to the repo directory in AsterixDB. That
would work. As for the repacking part, that was for those who want to
distribute their patched AsterixDB to their users. In your case, you can
ignore that.

Best,
Xikui

On Wed, Nov 28, 2018 at 1:51 AM sandraskarshaug@gmail.com <
sandraskarshaug@gmail.com> wrote:

> Hi, thanks again Xikui!
>
> I am trying the latter option now – dropping the dependency jars into the
> /repo folder. Does it have anything to say where I copy the dependency jars
> from?
>
> In addition, I think I should provide some context of my locally run
> instance of AsterixDB:
> - I have cloned the asterixdb repo from github, so I have it local on my
> Macbook Pro.
> - Inside the cloned folder,
> asterixdb/asterixdb/asterix-server/target/asterix-server-0.9.5-SNAPSHOT-binary-assembly
> folder, there lies a folder called apache-asterixdb-0.9.5-SNAPSHOT, which
> in turn contains the folders bin, etc, lib, opt and repo.
> - It is inside _this_ repo folder I am putting the dependency jars.
> - It is from this /opt/local/bin folder I am running sh
> start-sample-cluster.sh
>
> So, when following the option 2 example provided in your link [1], it says
> to repach this folder into a zip again. I don't quite get this, as this is
> the folder I am using to run AsterixDB?
>
> Thanks in advance!
>
> Best regards,
> Sandra
>
>
> On 2018/11/27 16:38:23, Xikui Wang <xikuiw@uci.edu> wrote:
> > The configuration seems alright, but it's very hard to say where the
> > problem is since I haven't had the chance to see what is exactly in your
> > lib directory. If this packaging doesn't work for you, you can try to
> pack
> > the dependencies into the UDF jar as a single fat jar, or you can drop
> the
> > dependency jars into the "asterix-server-0.9.*-binary-assembly/repo"
> directory,
> > so they can be distributed with the AsterixDB instance. I would recommend
> > the latter method, as you don't have to redeploy the dependency jars
> every
> > time when a UDF changes. These two methods are described in the
> > documentation of the UDF template repo [1]. :)
> >
> > [1] https://github.com/idleft/asterix-udf-template
> >
> > Best,
> > Xikui
> >
> > On Tue, Nov 27, 2018 at 6:04 AM sandraskarshaug@gmail.com <
> > sandraskarshaug@gmail.com> wrote:
> >
> > > Thank you for making sense of the log file for me, I managed to get the
> > > parameters work!
> > >
> > > However, a new challenge became evident, of course. The new error that
> I
> > > am seeing (java.lang.ClassNotFoundException in the cc.log when trying
> to
> > > use one of the dependencies in my code). I think this may be happening
> due
> > > to the external dependency, and if it is reachable or not from my UDF
> when
> > > running locally on AsterixDB. Could you explain if my approach for
> > > including external dependencies are right or not (approach/steps listed
> > > below)?
> > >
> > > 1. The binary-assembly-libzip.xml looks like this, where the
> dependencies
> > > are included at the bottom:
> > >
> > > <assembly>
> > >   <id>testlib</id>
> > >   <formats>
> > >     <format>zip</format>
> > >   </formats>
> > >   <includeBaseDirectory>false</includeBaseDirectory>
> > >   <fileSets>
> > >     <fileSet>
> > >       <directory>target</directory>
> > >       <outputDirectory/>
> > >       <includes>
> > >         <include>*.jar</include>
> > >       </includes>
> > >     </fileSet>
> > >     <fileSet>
> > >       <directory>src/main/resources</directory>
> > >       <outputDirectory/>
> > >       <includes>
> > >         <include>library_descriptor.xml</include>
> > >       </includes>
> > >     </fileSet>
> > >   </fileSets>
> > >   <dependencySets>
> > >     <dependencySet>
> > >       <includes>
> > >         <include>commons-io:commons-io</include>
> > >         <include>ch.qos.logback:logback-core</include>
> > >         <include>org.slf4j:slf4j-api</include>
> > >         <include>ch.qos.logback:logback-classic</include>
> > >         <include>org.deeplearning4j:deeplearning4j-core</include>
> > >
>  <include>org.deeplearning4j:deeplearning4j-modelimport</include>
> > >         <include>org.deeplearning4j:deeplearning4j-nlp</include>
> > >         <include>org.nd4j:nd4j-api</include>
> > >         <include>org.nd4j:nd4j-native</include>
> > >       </includes>
> > >       <unpack>false</unpack>
> > >       <outputDirectory>lib</outputDirectory>
> > >     </dependencySet>
> > >   </dependencySets>
> > > </assembly>
> > >
> > > 2. When the Maven project is built (mvn clean install), it generates
> files
> > > in /target:
> > > - asterix-udf-template-0.1-SNAPSHOT-testlib.zip
> > > - asterix-udf-template-0.1-SNAPSHOT.jar
> > > - archive-tmp
> > > - classes
> > > - generated-sources
> > > - maven-archiver
> > > - maven-status
> > >
> > > 3. When unzipping the uppermost file (testlib), it contains:
> > > - lib (dictionary containing .jars for my dependencies listed above in
> > > binary-assembly-libzip.xml)
> > > - library_descriptor.xml
> > > - asterix-udf-template-0.1-SNAPSHOT.jar
> > >
> > > 4. And when unzipping the bottommost .jar inside the testlib here, it
> > > contains:
> > > - my model (model.bin.gz)
> > > - library_descriptor.xml
> > > - META-INF
> > > - org.apache.asterix.external
> > > ----> contains my classes
> > >
> > > Does this look right?
> > >
> > > I appreciate your help!
> > >
> > > Best regards,
> > > Sandra
> > >
> > > On 2018/11/27 06:38:58, Xikui Wang <xikuiw@uci.edu> wrote:
> > > > Hi Sandra,
> > > >
> > > > Based on the log, it seems you have an IndexOutOfBoundsException in
> your
> > > > UDF code. Can you double check your UDF at
> > > >
> > >
> org.apache.asterix.external.library.RelevanceDetecterFunction.initialize(RelevanceDetecterFunction.java:33)
> > > > and your UDF configuration file? You will have to make sure the
> > > parameters
> > > > are specified properly in the config file, and they are properly
> accessed
> > > > in the initialize method.
> > > >
> > > > Best,
> > > > Xikui
> > > >
> > > > On Mon, Nov 26, 2018 at 1:33 PM sandraskarshaug@gmail.com <
> > > > sandraskarshaug@gmail.com> wrote:
> > > >
> > > > > Hi Xikui!
> > > > >
> > > > > So I tried to add the resource as a parameter. However, I get this
> > > error
> > > > > (gist with log from cc.log) [1] when the query below is executed:
> > > > >
> > > > > USE feeds;
> > > > > CONNECT FEED TestSocketFeed TO DATASET RelevantDataset
> > > > > APPLY function testlib#detectRelevance; start feed TestSocketFeed
> > > > >
> > > > > To provide some context, this query works as it should when I don't
> > > > > include the model.
> > > > >
> > > > > [1]
> > > https://gist.github.com/sandraskars/3f707d9b07e5b6c1006368a297b6eacb
> > > > >
> > > > > Best regards,
> > > > > Sandra
> > > > >
> > > > >
> > > > >
> > > > > On 2018/11/26 05:45:03, Xikui Wang <xikuiw@uci.edu> wrote:
> > > > > > Hi Sandra,
> > > > > >
> > > > > > Here is an example for adding parameters to a UDF [1]. As you
can
> > > see,
> > > > > the
> > > > > > function "KeywordsDetectorFactory" reads a given list path from
> a UDF
> > > > > > parameter. You can use this to reuse a Java function with
> different
> > > > > > resource files. This function is contained in the AsterixDB
> release
> > > as
> > > > > > well. Please make sure the path to the resource file is correct
> when
> > > you
> > > > > > use it. That's a tricky part that I always make mistakes.
> > > > > >
> > > > > > The initialize(), i.e. the model loading, is executed when the
> "start
> > > > > feed"
> > > > > > statement is executed. This doesn't require Tweets to come.
Is
> that
> > > the
> > > > > > case you are referring to?
> > > > > >
> > > > > > As for your use case, here is an interesting thing that you
can
> try.
> > > > > There
> > > > > > is a feature in the data feeds which is currently not in our
> > > > > documentation,
> > > > > > which is to allow you to filter out incoming data by query
> > > predicates. If
> > > > > > you want to filter out Tweets with the model file that you
> trained,
> > > you
> > > > > can
> > > > > > attach a Java UDF on your ingestion pipeline with the following
> > > query:
> > > > > >
> > > > > > use test;
> > > > > > create type InputRecordType as closed {
> > > > > > id:int64,
> > > > > > fname:string,
> > > > > > lname:string,
> > > > > > age:int64,
> > > > > > dept:string
> > > > > > };
> > > > > > create dataset EmpDataset(InputRecordType) primary key id;
> > > > > > create feed UserFeed with {
> > > > > >     "adapter-name" : "socket_adapter",
> > > > > >     "sockets" : "127.0.0.1:10001",
> > > > > >     "address-type" : "IP",
> > > > > >     "type-name" : "InputRecordType",
> > > > > >     "format" : "delimited-text",
> > > > > >     "delimiter" : "|",
> > > > > >     "upsert-feed" : "true"
> > > > > > };
> > > > > > *connect feed UserFeed to dataset EmpDataset WHERE
> > > > > > testlib#wordDetector(fname) = TRUE;*
> > > > > > start feed UserFeed;
> > > > > >
> > > > > > The Java UDF used here is in [2]. This can help you filter out
> > > unwanted
> > > > > > incoming data on the pipeline. :)
> > > > > >
> > > > > > [1]
> > > > > >
> > > > >
> > >
> https://github.com/idleft/asterix-udf-template/blob/master/src/main/resources/library_descriptor.xml
> > > > > >
> > > > > > [2]
> > > > > >
> > > > >
> > >
> https://github.com/idleft/asterix-udf-template/blob/master/src/main/java/org/apache/asterix/external/library/WordInListFunction.java
> > > > > >
> > > > > > Best,
> > > > > > Xikui
> > > > > >
> > > > > > On Sun, Nov 25, 2018 at 1:05 PM sandraskarshaug@gmail.com <
> > > > > > sandraskarshaug@gmail.com> wrote:
> > > > > >
> > > > > > > Hi Xikui,
> > > > > > >
> > > > > > > Thanks for your response!
> > > > > > > We managed to cope with the problem by using the compressed
> > > version of
> > > > > the
> > > > > > > model instead, but it is still 1.6 GB. However, the project
is
> > > able to
> > > > > > > build now :-) Yes, this is being packed into the UDF jar
at the
> > > > > moment.  Do
> > > > > > > you have any examples that illustrates how to use the resource
> file
> > > > > path as
> > > > > > > a UDF parameter? That would be very helpful!
> > > > > > >
> > > > > > > In addition, I believe that the model loading – which
is now
> being
> > > > > > > executed during initialize() – restrains the incoming
tweets of
> > > being
> > > > > > > processed. This is evident because none of the streaming
> elements
> > > are
> > > > > > > stored in AsterixDB when the model loading is included
in the
> code,
> > > > > whilst
> > > > > > > the elements are stored when I exclude the model loading
from
> the
> > > > > code. Is
> > > > > > > it possible to make the model load, i.e making initialize()
> run,
> > > prior
> > > > > the
> > > > > > > arrival of the tweets at the socketfeed?
> > > > > > >
> > > > > > > Regarding our project, we are trying to detect tweets which
are
> > > > > relevant
> > > > > > > for a given "user query", where the goal is crisis detection.
> So
> > > we are
> > > > > > > trying to filter out (i.e _not_ store or keep in the pipeline)
> > > tweets
> > > > > which
> > > > > > > do not contain the relevant location etc. The model I've
talked
> > > about
> > > > > is
> > > > > > > being used for word embeddings (word2vec) :-)
> > > > > > >
> > > > > > > Best regards,
> > > > > > > Sandra Skarshaug
> > > > > > >
> > > > > > >
> > > > > > > On 2018/11/24 17:55:27, Xikui Wang <xikuiw@uci.edu>
wrote:
> > > > > > > > Hi Sandra,
> > > > > > > >
> > > > > > > > How big is the model file that you are using? I guess
you are
> > > trying
> > > > > to
> > > > > > > > pack this model file into the UDF jar? I personally
haven't
> seen
> > > this
> > > > > > > error
> > > > > > > > before. It feels like a Maven building with big files
issue.
> I
> > > found
> > > > > this
> > > > > > > > thread on StackOverflow which describes the similar
> situation.
> > > Could
> > > > > you
> > > > > > > > try the resolutions there?
> > > > > > > >
> > > > > > > > As a side note, if you need to use a big model file
in UDF, I
> > > > > wouldn't
> > > > > > > > suggest you pack that into your UDF jar file. It's
because
> this
> > > will
> > > > > > > > significantly slow down your UDF installation, and
you will
> > > spend a
> > > > > lot
> > > > > > > of
> > > > > > > > time redeploying the resource file to the cluster
if you only
> > > need to
> > > > > > > > update the UDF code. Alternatively, you could make
the
> resource
> > > file
> > > > > path
> > > > > > > > as a UDF parameter, and let the UDF load that file
when it
> > > > > initializes.
> > > > > > > > This could make the installation much faster and avoid
> deploying
> > > the
> > > > > > > > resource file multiple times, and the packing issue
should be
> > > gone as
> > > > > > > well.
> > > > > > > > :)
> > > > > > > >
> > > > > > > > PS If it's ok, could you tell us which use case that
you are
> > > working
> > > > > on?
> > > > > > > We
> > > > > > > > would like to know how our customers use AsterixDB
in
> different
> > > > > > > scenarios,
> > > > > > > > so we can help them (you) better!
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Xikui
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Sat, Nov 24, 2018 at 6:05 AM sandraskarshaug@gmail.com
<
> > > > > > > > sandraskarshaug@gmail.com> wrote:
> > > > > > > >
> > > > > > > > > Hi!
> > > > > > > > >
> > > > > > > > > My master thesis partner and I have added a model
for word
> > > > > embeddings
> > > > > > > > > (word2vec) in our project which is quite large.
This is
> > > supposed
> > > > > to be
> > > > > > > > > loaded in the initialize phase of the UDF and
be used for
> > > > > evaluating
> > > > > > > the
> > > > > > > > > incoming records.
> > > > > > > > >
> > > > > > > > > However, when trying to build the Maven project
before
> > > deploying
> > > > > it to
> > > > > > > > > AsterixDB, we get the error "Error assembling
JAR, invalid
> > > entry
> > > > > > > size". Is
> > > > > > > > > this a problem anyone else have faced when for
instance
> using
> > > > > machine
> > > > > > > > > learning models in AsterixDB?
> > > > > > > > >
> > > > > > > > > If so, we appreciate any help!
> > > > > > > > >
> > > > > > > > > Best regards,
> > > > > > > > > Sandra
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message