asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xikui Wang <xik...@uci.edu>
Subject Re: Build UDF project (Maven) with large model to deploy to AsterixDB [Error assembling jar]
Date Thu, 29 Nov 2018 07:03:48 GMT
Yes. You can delete the lib folder since these dependencies are picked up
by AsterixDB from repo/. Having dependency jars in one of the two places
should be sufficient. :)

Best,
Xikui

On Wed, Nov 28, 2018 at 10:16 AM sandraskarshaug@gmail.com <
sandraskarshaug@gmail.com> wrote:

> Hi Xikui,
>
> So when deploying my UDF to AsterixDB, I've put the content of the
> unzipped testlib folder into this folder:
> apache-asterixdb-0.9.5-SNAPSHOT/lib/udfs/feeds/testlib/
>
> The resulting testlib content then looks like this:
> - library_descriptor.xml
> - asterix-udf-template-0.1-SNAPSHOT.jar
> - lib (folder with external dependencies)
>
> However, since the dependencies from this /lib folder ought to be copied
> into apache-asterixdb-0.9.5-SNAPSHOT/repo instead, should I delete the
> apache-asterixdb-0.9.5-SNAPSHOT/lib/udfs/feeds/testlib/lib folder which are
> created when dropping the unzipped UDF package inside testlib, or keep the
> dependencies both there and in /repo?
>
> Thanks!
>
>
> On 2018/11/28 17:03:31, Xikui Wang <xikuiw@uci.edu> wrote:
> > Hi Sandra,
> >
> > If you are following the binary-assembly-libzip.xml that you showed to me
> > earlier, the specified dependency jars should be under the lib directory
> in
> > your compiled UDF package, i.e., "- lib (dictionary containing .jars for
> my
> > dependencies listed above in binary-assembly-libzip.xml)". You can copy
> all
> > the jar files in this directory to the repo directory in AsterixDB. That
> > would work. As for the repacking part, that was for those who want to
> > distribute their patched AsterixDB to their users. In your case, you can
> > ignore that.
> >
> > Best,
> > Xikui
> >
> > On Wed, Nov 28, 2018 at 1:51 AM sandraskarshaug@gmail.com <
> > sandraskarshaug@gmail.com> wrote:
> >
> > > Hi, thanks again Xikui!
> > >
> > > I am trying the latter option now – dropping the dependency jars into
> the
> > > /repo folder. Does it have anything to say where I copy the dependency
> jars
> > > from?
> > >
> > > In addition, I think I should provide some context of my locally run
> > > instance of AsterixDB:
> > > - I have cloned the asterixdb repo from github, so I have it local on
> my
> > > Macbook Pro.
> > > - Inside the cloned folder,
> > >
> asterixdb/asterixdb/asterix-server/target/asterix-server-0.9.5-SNAPSHOT-binary-assembly
> > > folder, there lies a folder called apache-asterixdb-0.9.5-SNAPSHOT,
> which
> > > in turn contains the folders bin, etc, lib, opt and repo.
> > > - It is inside _this_ repo folder I am putting the dependency jars.
> > > - It is from this /opt/local/bin folder I am running sh
> > > start-sample-cluster.sh
> > >
> > > So, when following the option 2 example provided in your link [1], it
> says
> > > to repach this folder into a zip again. I don't quite get this, as
> this is
> > > the folder I am using to run AsterixDB?
> > >
> > > Thanks in advance!
> > >
> > > Best regards,
> > > Sandra
> > >
> > >
> > > On 2018/11/27 16:38:23, Xikui Wang <xikuiw@uci.edu> wrote:
> > > > The configuration seems alright, but it's very hard to say where the
> > > > problem is since I haven't had the chance to see what is exactly in
> your
> > > > lib directory. If this packaging doesn't work for you, you can try to
> > > pack
> > > > the dependencies into the UDF jar as a single fat jar, or you can
> drop
> > > the
> > > > dependency jars into the "asterix-server-0.9.*-binary-assembly/repo"
> > > directory,
> > > > so they can be distributed with the AsterixDB instance. I would
> recommend
> > > > the latter method, as you don't have to redeploy the dependency jars
> > > every
> > > > time when a UDF changes. These two methods are described in the
> > > > documentation of the UDF template repo [1]. :)
> > > >
> > > > [1] https://github.com/idleft/asterix-udf-template
> > > >
> > > > Best,
> > > > Xikui
> > > >
> > > > On Tue, Nov 27, 2018 at 6:04 AM sandraskarshaug@gmail.com <
> > > > sandraskarshaug@gmail.com> wrote:
> > > >
> > > > > Thank you for making sense of the log file for me, I managed to
> get the
> > > > > parameters work!
> > > > >
> > > > > However, a new challenge became evident, of course. The new error
> that
> > > I
> > > > > am seeing (java.lang.ClassNotFoundException in the cc.log when
> trying
> > > to
> > > > > use one of the dependencies in my code). I think this may be
> happening
> > > due
> > > > > to the external dependency, and if it is reachable or not from my
> UDF
> > > when
> > > > > running locally on AsterixDB. Could you explain if my approach for
> > > > > including external dependencies are right or not (approach/steps
> listed
> > > > > below)?
> > > > >
> > > > > 1. The binary-assembly-libzip.xml looks like this, where the
> > > dependencies
> > > > > are included at the bottom:
> > > > >
> > > > > <assembly>
> > > > >   <id>testlib</id>
> > > > >   <formats>
> > > > >     <format>zip</format>
> > > > >   </formats>
> > > > >   <includeBaseDirectory>false</includeBaseDirectory>
> > > > >   <fileSets>
> > > > >     <fileSet>
> > > > >       <directory>target</directory>
> > > > >       <outputDirectory/>
> > > > >       <includes>
> > > > >         <include>*.jar</include>
> > > > >       </includes>
> > > > >     </fileSet>
> > > > >     <fileSet>
> > > > >       <directory>src/main/resources</directory>
> > > > >       <outputDirectory/>
> > > > >       <includes>
> > > > >         <include>library_descriptor.xml</include>
> > > > >       </includes>
> > > > >     </fileSet>
> > > > >   </fileSets>
> > > > >   <dependencySets>
> > > > >     <dependencySet>
> > > > >       <includes>
> > > > >         <include>commons-io:commons-io</include>
> > > > >         <include>ch.qos.logback:logback-core</include>
> > > > >         <include>org.slf4j:slf4j-api</include>
> > > > >         <include>ch.qos.logback:logback-classic</include>
> > > > >         <include>org.deeplearning4j:deeplearning4j-core</include>
> > > > >
> > >  <include>org.deeplearning4j:deeplearning4j-modelimport</include>
> > > > >         <include>org.deeplearning4j:deeplearning4j-nlp</include>
> > > > >         <include>org.nd4j:nd4j-api</include>
> > > > >         <include>org.nd4j:nd4j-native</include>
> > > > >       </includes>
> > > > >       <unpack>false</unpack>
> > > > >       <outputDirectory>lib</outputDirectory>
> > > > >     </dependencySet>
> > > > >   </dependencySets>
> > > > > </assembly>
> > > > >
> > > > > 2. When the Maven project is built (mvn clean install), it
> generates
> > > files
> > > > > in /target:
> > > > > - asterix-udf-template-0.1-SNAPSHOT-testlib.zip
> > > > > - asterix-udf-template-0.1-SNAPSHOT.jar
> > > > > - archive-tmp
> > > > > - classes
> > > > > - generated-sources
> > > > > - maven-archiver
> > > > > - maven-status
> > > > >
> > > > > 3. When unzipping the uppermost file (testlib), it contains:
> > > > > - lib (dictionary containing .jars for my dependencies listed
> above in
> > > > > binary-assembly-libzip.xml)
> > > > > - library_descriptor.xml
> > > > > - asterix-udf-template-0.1-SNAPSHOT.jar
> > > > >
> > > > > 4. And when unzipping the bottommost .jar inside the testlib here,
> it
> > > > > contains:
> > > > > - my model (model.bin.gz)
> > > > > - library_descriptor.xml
> > > > > - META-INF
> > > > > - org.apache.asterix.external
> > > > > ----> contains my classes
> > > > >
> > > > > Does this look right?
> > > > >
> > > > > I appreciate your help!
> > > > >
> > > > > Best regards,
> > > > > Sandra
> > > > >
> > > > > On 2018/11/27 06:38:58, Xikui Wang <xikuiw@uci.edu> wrote:
> > > > > > Hi Sandra,
> > > > > >
> > > > > > Based on the log, it seems you have an IndexOutOfBoundsException
> in
> > > your
> > > > > > UDF code. Can you double check your UDF at
> > > > > >
> > > > >
> > >
> org.apache.asterix.external.library.RelevanceDetecterFunction.initialize(RelevanceDetecterFunction.java:33)
> > > > > > and your UDF configuration file? You will have to make sure
the
> > > > > parameters
> > > > > > are specified properly in the config file, and they are properly
> > > accessed
> > > > > > in the initialize method.
> > > > > >
> > > > > > Best,
> > > > > > Xikui
> > > > > >
> > > > > > On Mon, Nov 26, 2018 at 1:33 PM sandraskarshaug@gmail.com <
> > > > > > sandraskarshaug@gmail.com> wrote:
> > > > > >
> > > > > > > Hi Xikui!
> > > > > > >
> > > > > > > So I tried to add the resource as a parameter. However,
I get
> this
> > > > > error
> > > > > > > (gist with log from cc.log) [1] when the query below is
> executed:
> > > > > > >
> > > > > > > USE feeds;
> > > > > > > CONNECT FEED TestSocketFeed TO DATASET RelevantDataset
> > > > > > > APPLY function testlib#detectRelevance; start feed
> TestSocketFeed
> > > > > > >
> > > > > > > To provide some context, this query works as it should
when I
> don't
> > > > > > > include the model.
> > > > > > >
> > > > > > > [1]
> > > > >
> https://gist.github.com/sandraskars/3f707d9b07e5b6c1006368a297b6eacb
> > > > > > >
> > > > > > > Best regards,
> > > > > > > Sandra
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On 2018/11/26 05:45:03, Xikui Wang <xikuiw@uci.edu>
wrote:
> > > > > > > > Hi Sandra,
> > > > > > > >
> > > > > > > > Here is an example for adding parameters to a UDF
[1]. As
> you can
> > > > > see,
> > > > > > > the
> > > > > > > > function "KeywordsDetectorFactory" reads a given list
path
> from
> > > a UDF
> > > > > > > > parameter. You can use this to reuse a Java function
with
> > > different
> > > > > > > > resource files. This function is contained in the
AsterixDB
> > > release
> > > > > as
> > > > > > > > well. Please make sure the path to the resource file
is
> correct
> > > when
> > > > > you
> > > > > > > > use it. That's a tricky part that I always make mistakes.
> > > > > > > >
> > > > > > > > The initialize(), i.e. the model loading, is executed
when
> the
> > > "start
> > > > > > > feed"
> > > > > > > > statement is executed. This doesn't require Tweets
to come.
> Is
> > > that
> > > > > the
> > > > > > > > case you are referring to?
> > > > > > > >
> > > > > > > > As for your use case, here is an interesting thing
that you
> can
> > > try.
> > > > > > > There
> > > > > > > > is a feature in the data feeds which is currently
not in our
> > > > > > > documentation,
> > > > > > > > which is to allow you to filter out incoming data
by query
> > > > > predicates. If
> > > > > > > > you want to filter out Tweets with the model file
that you
> > > trained,
> > > > > you
> > > > > > > can
> > > > > > > > attach a Java UDF on your ingestion pipeline with
the
> following
> > > > > query:
> > > > > > > >
> > > > > > > > use test;
> > > > > > > > create type InputRecordType as closed {
> > > > > > > > id:int64,
> > > > > > > > fname:string,
> > > > > > > > lname:string,
> > > > > > > > age:int64,
> > > > > > > > dept:string
> > > > > > > > };
> > > > > > > > create dataset EmpDataset(InputRecordType) primary
key id;
> > > > > > > > create feed UserFeed with {
> > > > > > > >     "adapter-name" : "socket_adapter",
> > > > > > > >     "sockets" : "127.0.0.1:10001",
> > > > > > > >     "address-type" : "IP",
> > > > > > > >     "type-name" : "InputRecordType",
> > > > > > > >     "format" : "delimited-text",
> > > > > > > >     "delimiter" : "|",
> > > > > > > >     "upsert-feed" : "true"
> > > > > > > > };
> > > > > > > > *connect feed UserFeed to dataset EmpDataset WHERE
> > > > > > > > testlib#wordDetector(fname) = TRUE;*
> > > > > > > > start feed UserFeed;
> > > > > > > >
> > > > > > > > The Java UDF used here is in [2]. This can help you
filter
> out
> > > > > unwanted
> > > > > > > > incoming data on the pipeline. :)
> > > > > > > >
> > > > > > > > [1]
> > > > > > > >
> > > > > > >
> > > > >
> > >
> https://github.com/idleft/asterix-udf-template/blob/master/src/main/resources/library_descriptor.xml
> > > > > > > >
> > > > > > > > [2]
> > > > > > > >
> > > > > > >
> > > > >
> > >
> https://github.com/idleft/asterix-udf-template/blob/master/src/main/java/org/apache/asterix/external/library/WordInListFunction.java
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Xikui
> > > > > > > >
> > > > > > > > On Sun, Nov 25, 2018 at 1:05 PM sandraskarshaug@gmail.com
<
> > > > > > > > sandraskarshaug@gmail.com> wrote:
> > > > > > > >
> > > > > > > > > Hi Xikui,
> > > > > > > > >
> > > > > > > > > Thanks for your response!
> > > > > > > > > We managed to cope with the problem by using
the compressed
> > > > > version of
> > > > > > > the
> > > > > > > > > model instead, but it is still 1.6 GB. However,
the
> project is
> > > > > able to
> > > > > > > > > build now :-) Yes, this is being packed into
the UDF jar
> at the
> > > > > > > moment.  Do
> > > > > > > > > you have any examples that illustrates how to
use the
> resource
> > > file
> > > > > > > path as
> > > > > > > > > a UDF parameter? That would be very helpful!
> > > > > > > > >
> > > > > > > > > In addition, I believe that the model loading
– which is
> now
> > > being
> > > > > > > > > executed during initialize() – restrains the
incoming
> tweets of
> > > > > being
> > > > > > > > > processed. This is evident because none of the
streaming
> > > elements
> > > > > are
> > > > > > > > > stored in AsterixDB when the model loading is
included in
> the
> > > code,
> > > > > > > whilst
> > > > > > > > > the elements are stored when I exclude the model
loading
> from
> > > the
> > > > > > > code. Is
> > > > > > > > > it possible to make the model load, i.e making
initialize()
> > > run,
> > > > > prior
> > > > > > > the
> > > > > > > > > arrival of the tweets at the socketfeed?
> > > > > > > > >
> > > > > > > > > Regarding our project, we are trying to detect
tweets
> which are
> > > > > > > relevant
> > > > > > > > > for a given "user query", where the goal is crisis
> detection.
> > > So
> > > > > we are
> > > > > > > > > trying to filter out (i.e _not_ store or keep
in the
> pipeline)
> > > > > tweets
> > > > > > > which
> > > > > > > > > do not contain the relevant location etc. The
model I've
> talked
> > > > > about
> > > > > > > is
> > > > > > > > > being used for word embeddings (word2vec) :-)
> > > > > > > > >
> > > > > > > > > Best regards,
> > > > > > > > > Sandra Skarshaug
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On 2018/11/24 17:55:27, Xikui Wang <xikuiw@uci.edu>
wrote:
> > > > > > > > > > Hi Sandra,
> > > > > > > > > >
> > > > > > > > > > How big is the model file that you are using?
I guess
> you are
> > > > > trying
> > > > > > > to
> > > > > > > > > > pack this model file into the UDF jar? I
personally
> haven't
> > > seen
> > > > > this
> > > > > > > > > error
> > > > > > > > > > before. It feels like a Maven building with
big files
> issue.
> > > I
> > > > > found
> > > > > > > this
> > > > > > > > > > thread on StackOverflow which describes
the similar
> > > situation.
> > > > > Could
> > > > > > > you
> > > > > > > > > > try the resolutions there?
> > > > > > > > > >
> > > > > > > > > > As a side note, if you need to use a big
model file in
> UDF, I
> > > > > > > wouldn't
> > > > > > > > > > suggest you pack that into your UDF jar
file. It's
> because
> > > this
> > > > > will
> > > > > > > > > > significantly slow down your UDF installation,
and you
> will
> > > > > spend a
> > > > > > > lot
> > > > > > > > > of
> > > > > > > > > > time redeploying the resource file to the
cluster if you
> only
> > > > > need to
> > > > > > > > > > update the UDF code. Alternatively, you
could make the
> > > resource
> > > > > file
> > > > > > > path
> > > > > > > > > > as a UDF parameter, and let the UDF load
that file when
> it
> > > > > > > initializes.
> > > > > > > > > > This could make the installation much faster
and avoid
> > > deploying
> > > > > the
> > > > > > > > > > resource file multiple times, and the packing
issue
> should be
> > > > > gone as
> > > > > > > > > well.
> > > > > > > > > > :)
> > > > > > > > > >
> > > > > > > > > > PS If it's ok, could you tell us which use
case that you
> are
> > > > > working
> > > > > > > on?
> > > > > > > > > We
> > > > > > > > > > would like to know how our customers use
AsterixDB in
> > > different
> > > > > > > > > scenarios,
> > > > > > > > > > so we can help them (you) better!
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Xikui
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Sat, Nov 24, 2018 at 6:05 AM
> sandraskarshaug@gmail.com <
> > > > > > > > > > sandraskarshaug@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi!
> > > > > > > > > > >
> > > > > > > > > > > My master thesis partner and I have
added a model for
> word
> > > > > > > embeddings
> > > > > > > > > > > (word2vec) in our project which is
quite large. This is
> > > > > supposed
> > > > > > > to be
> > > > > > > > > > > loaded in the initialize phase of the
UDF and be used
> for
> > > > > > > evaluating
> > > > > > > > > the
> > > > > > > > > > > incoming records.
> > > > > > > > > > >
> > > > > > > > > > > However, when trying to build the Maven
project before
> > > > > deploying
> > > > > > > it to
> > > > > > > > > > > AsterixDB, we get the error "Error
assembling JAR,
> invalid
> > > > > entry
> > > > > > > > > size". Is
> > > > > > > > > > > this a problem anyone else have faced
when for instance
> > > using
> > > > > > > machine
> > > > > > > > > > > learning models in AsterixDB?
> > > > > > > > > > >
> > > > > > > > > > > If so, we appreciate any help!
> > > > > > > > > > >
> > > > > > > > > > > Best regards,
> > > > > > > > > > > Sandra
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message