hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 顾荣 <gurongwal...@gmail.com>
Subject Re: Does Hama Graph provides any file reader interface during running time ?
Date Thu, 20 Sep 2012 15:40:01 GMT
Hi Thomas,

I have not upload cNeural to web yet, for I wrote the code tightly coupled
with the configurtaion of the Hadoop and HBase running on our lab cluster.
This is orginally a experiment for scientific research purpose. As you
suggested, I will orgianze the code and upload it to the web. Now, I am
also planing to implment my alogrithm in Hama BSP. It's suitable.

I will send message to you in this mail session, if there is more progess
later:)

Nice to talk to you.

Walker

2012/9/20 Thomas Jungblut <thomas.jungblut@gmail.com>

> Hey walker,
>
> cool thing, can you share a link to your cNeural library?
> Martin is working on GPU with Hama in relation to Hama pipes (
> https://issues.apache.org/jira/browse/HAMA-619).
> He wants to go the native way, but personally I made pretty good
> experiences with JCUDA, I used it for my neural net implementation for
> large input neurons in image recognition tasks.
> However that's just a part of making the matrix multiplications faster
> which usually takes most of the time.
>
> The HBase storage is really interesting, thanks for sharing!
>
> 2012/9/20 顾荣 <gurongwalker@gmail.com>
>
> > Hi, Thomas.
> >
> > I read your blog and github on the information about training NN on Hama
> > several days ago. I am agree with you on this topic for my experiences on
> > implementing NN in a distribute way.
> > That happens when I did not know Hama project. Thus, I implemented a
> > customized distribuited system for training NN with large scale training
> > data myself, the system is called cNeural.
> > It is basically fellows a Master/Slave archtecture.I adopted Hadoop RPC
> > for communication and HBase for storing large scale training dataset,
> and I
> > used a batch-mode the BP tranining algorithm.
> > BTW, HBase is very suitable for store traning data sets for machine
> > learning. No matter how large a traning data set is, a HTable can easily
> > store it across many regionservers.
> > Each traning sample can be stored as a record in HTable, even it's sparse
> > coded. Furtherly, HBase provide random access to your training sample. In
> > my experience,
> > it's much better to store the structured data in Hbase than directly in
> > HDFS.
> >
> > Back to this topic, as you mentioned, I can read training data directlly
> > from HDFS through HDFS API, during the setup stage of the vertex.
> > I also considered this and know how to use HDFS API long ago, thanks for
> > hint anyway:)
> > However, I am afraid of that it may cost quite a lot of time, because for
> > a large sacle NN with thousands of neurons,
> > each neuron vertex almost simutanluously reads the same traning sample
> > would cost a lot of network traffic
> > and put too much stress on HDFS. What's more, it seems unnecessary. I
> > planed to select a master vector
> > responsible for reading samples for HDFS, and intialize each input neuro
> > by sending the feature value to this vertex.
> > However, even though I can do this, there are a lot more tough problems
> to
> > solve, such as partition. As you said, to
> > conrol this training workflow in a distributed way is too complex. And
> > with so many network communication and distribute
> > synchronization, it will be much slower than the sequential programe
> > executed on a single machine. In a word,
> > this tough distribution wil probably leads to no improvment but slower
> > speed and high complexity. As you talk about for high
> > dimensionalities, I suggest to use GPU to handle this. Distribution may
> > not be a good solution in this case. Of course, we
> > can combine GPU with Hama, and it's necessary in the near future, I
> > believe.
> >
> > As I have mentioned at the beginning of this mail. I implemented cNeural,
> > and I also compare cNeural with Hadoop for sloving this problem.
> > The experiment results can be find in the attachment of this mail. In
> > general, cNeural adopted a parallel strategy like BSP model. So, I am
> about
> >  to reimplement cNeural on Hama BSP. I learned Hama Graph this week, and
> > just come across a thought of implementing NN on Hama Graph,
> > considered about this case, and asked this question. I am agree with you
> > on your analysis.
> >
> > Regards,
> > Walker.
> >
> >
> >
> > 2012/9/20 Thomas Jungblut <thomas.jungblut@gmail.com>
> >
> >> Hi,
> >>
> >> nice idea, but I'm certainly unsure if the graph module really fits your
> >> needs.
> >> In Backprop you need to set the input to different neurons in your input
> >> layer and you have to forwardpropagate these until you reach the output
> >> layer. Calculating the error from this single step in your architecture
> >> would consume many supersteps. This is totally inefficient in my
> >> opinion, but let's just take this thought away.
> >>
> >> Assuming you have an n by m matrix which contains your whole trainingset
> >> and in the m-th column there is the outcome of the previous features.
> >> A input vertex should have the ability to read a row of the
> corresponding
> >> column vector from the trainingset and the output neurons need to do the
> >> same.
> >> Good news, you can do this by reading a file within the setup function
> of
> >> a
> >> vertex or by reading it line by line when compute is called. You can
> >> access
> >> filesystems with the Hadoop DFS API pretty easily. Just type it into
> your
> >> favourite search engines, it is just called FileSystem and you can get
> it
> >> by using FileSystem.get(Configuration conf).
> >>
> >> Now here is my experience with a raw BSP and neural networks if you
> >> consider this against the graph module:
> >> - partition the neurons horizontally (through the layers) not by the
> >> layers
> >> - weights mustbe averaged across multiple tasks
> >>
> >> I came for myself to conclude that it is fairly better to implement a
> >> function optimizer with raw BSP to train the weights (a simple
> >> StochasticGradientDescent totally works out for almost every normal
> >> usecase
> >> if your network has a convex costfunction).
> >> Of course this doesn't work out well for higher dimensionalities, but
> more
> >> data usually wins, even with simpler models. At the end you can always
> >> boost it anyway.
> >>
> >> I will of course support you on this if you like, I'm fairly certain
> that
> >> your way can work, but will be slow as hell.
> >> Just my usual two cents on various topics ;)
> >>
> >> 2012/9/20 顾荣 <gurongwalker@gmail.com>
> >>
> >> > Hi, guys.
> >> >
> >> > As you are calling for some application programs on Hama in the
> *Future
> >> > Plans* of the Hama programming wiki here (
> >> >
> >> >
> >>
> https://issues.apache.org/jira/secure/attachment/12528218/ApacheHamaBSPProgrammingmodel.pdf
> >> > ),
> >> > I am so interested in machine learning. I have a plan to implement
> >> neural
> >> > networks (eg.Multilayer Perceptron with BP) on Hama. Hama seems to be
> a
> >> > nice tool for training large scale neural networks. Esepcailly, for
> >> those
> >> > with large scale structure (many hidden layers and many neurons), I
> find
> >> > Hama Graph provided a good solution. We can regard each neuron in
> >> NN(neural
> >> > network) as a vertex in Hama Graph, and the links between neurons as
> >> eages
> >> > in the Graph. Then, the training process can be regarded as updating
> the
> >> > weights of the eages among vetices. However, I encounted a problem in
> >> the
> >> > current Hama Graph implementation.
> >> >
> >> > Let me explain this to you. As you maybe now, during the training
> >> process
> >> > of many machine learning algorithms, we need to input many training
> >> samples
> >> > into the model one by one. Usaually, more training samples will lead
> to
> >> > preciser models. However, as far as I know, the only input file
> >> interface
> >> > provided by the Hama Graph is the input for graph structure. Sadly,
> it's
> >> > hard to read the distribute the training samples during running time,
> as
> >> > users can only make their computing logics by overriding the some key
> >> > functions such as compute() int the Vetex class. So, does hama graph
> >> > provide any flexible file reading interface for users in running time?
> >> >
> >> > Thanks in advance.
> >> >
> >> > Walker.
> >> >
> >>
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message