mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sheng Zha <zhash...@apache.org>
Subject Re: MXNet - Gluon - Audio
Date Tue, 20 Nov 2018 20:14:37 GMT
Hi Gaurav,

The performance concerns is not just around librosa, but also the way to integrate it. librosa
as a python library requires holding GIL when calling it, which makes it hard for asynchronous
data preprocessing during training. Also, the API design hasn't been verified on the more
full-fledged use cases that you outlined. That, and based on the lack of expertise of audio
processing reviewing the design doc, my suggestion is to continue the work as a Gluon example,
until other use cases are adopted, which is what you started in https://github.com/apache/incubator-mxnet/pull/13325.
Once you make more progress and become more familiar with Gluon design, please report back
to this thread and I'd be happy to help more on the review.

-sz

On 2018/11/20 19:20:18, Gaurav Gireesh <gaurav.gireesh@gmail.com> wrote: 
> Hi All!
> Following up on this PR:
> https://github.com/apache/incubator-mxnet/pull/13241
> I would need some comments or feedback regarding the API design :
> https://cwiki.apache.org/confluence/display/MXNET/Gluon+-+Audio
> 
> The comments on the PR were mostly around *librosa *and its performance
> being a blocker if and when the designed API can be tested with bigger ASR
> models DeepSpeech 2, DeepSpeech 3.
> I would appreciate if the community provides their expertise/knowledge on
> loading audio data and feature extraction used currently with bigger ARS
> models.
> If there is anything in design which may be changed/improved that will
> improve the performance, I ll be happy to look into this.
> 
> Thanks and regards,
> Gaurav Gireesh
> 
> On Thu, Nov 15, 2018 at 10:47 AM Gaurav Gireesh <gaurav.gireesh@gmail.com>
> wrote:
> 
> > Hi Lai!
> > Thank you for your comments!
> > Below are the answers to your comments/queries:
> > 1) That's a good suggestion. However, I have added an example in the Pull
> > request related to this:
> > https://github.com/apache/incubator-mxnet/pull/13241/commits/eabb68256d8fd603a0075eafcd8947d92e7df27f
> > .
> > I would be happy to include a dataset similar to MNIST to support that. I
> > have come across an example dataset used in tensor flow speech
> > related example here
> > <https://www.tensorflow.org/tutorials/sequences/audio_recognition>. This
> > could be included.
> >
> > 2) Thank you for the suggestion, I shall look into the FFT operator that
> > you have pointed out. However, there are other kind of features like, mfcc,
> > mels and so on which are popular in audio data feature extraction, which
> > will find utility if implemented. I am not sure if we have operators for
> > this.
> >
> > 3) The references look good too. I shall look into them. Thank you for
> > bringing them into my notice.
> >
> > Regards,
> > Gaurav
> >
> > On Tue, Nov 13, 2018 at 11:22 AM Lai Wei <royweilai@gmail.com> wrote:
> >
> >> Hi Gaurav,
> >>
> >> Thanks for starting this. I see the PR is out
> >> <https://github.com/apache/incubator-mxnet/pull/13241>, left some initial
> >> reviews, good work!
> >>
> >> In addition to Sandeep's queries, I have the following:
> >> 1. Can we include some simple classic audio dataset for users to directly
> >> import and try out? like MNIST in vision. (e.g.:
> >> http://pytorch.org/audio/datasets.html#yesno)
> >> 2. Librosa provides some good audio feature extractions, we can use it for
> >> now. But it's slow as you have to do conversions between ndarray and
> >> numpy.
> >> In the long term, can we make transforms to use mxnet operators and change
> >> your transforms to hybrid blocks? For example, mxnet FFT
> >> <
> >> https://mxnet.apache.org/api/python/ndarray/contrib.html?highlight=fft#mxnet.ndarray.contrib.fft
> >> >
> >> operator
> >> can be used in a hybrid block transformer, which will be a lot faster.
> >>
> >> Some additional references on users already using mxnet on audio, we
> >> should
> >> aim to make it easier and automate the file load/preprocess/transform
> >> process.
> >> 1. https://github.com/chen0040/mxnet-audio
> >> 2. https://github.com/shuokay/mxnet-wavenet
> >>
> >> Looking forward to seeing this feature out.
> >> Thanks!
> >>
> >> Best Regards
> >>
> >> Lai
> >>
> >>
> >> On Tue, Nov 13, 2018 at 9:09 AM sandeep krishnamurthy <
> >> sandeep.krishna98@gmail.com> wrote:
> >>
> >> > Thanks, Gaurav for starting this initiative. The design document is
> >> > detailed and gives all the information.
> >> > Starting to add this in "Contrib" is a good idea while we expect a few
> >> > rough edges and cleanups to follow.
> >> >
> >> > I had the following queries:
> >> > 1. Is there any analysis comparing LibROSA with other libraries? w.r.t
> >> > features, performance, community usage in audio data domain.
> >> > 2. What is the recommendation of LibROSA dependency? Part of MXNet PyPi
> >> or
> >> > ask the user to install if required? I prefer the latter, similar to
> >> > protobuf in ONNX-MXNet.
> >> > 3. I see LibROSA is a fully Python-based library. Are we getting
> >> blocked on
> >> > the dependency for future use cases when we want to make
> >> transformations as
> >> > operators and allow for cross-language support?
> >> > 4. In performance design considerations, with lazy=True / False the
> >> > performance difference is too scary ( 8 minutes to 4 hours!!) This
> >> requires
> >> > some more analysis. If we known turning a flag off/on has 24X
> >> performance
> >> > degradation, should we need to provide that control to user? What is the
> >> > impact of this on Memory usage?
> >> > 5. I see LibROSA has ISC license (
> >> > https://github.com/librosa/librosa/blob/master/LICENSE.md) which says
> >> free
> >> > to use with same license notification. I am not sure if this is ok. I
> >> > request other committers/mentors to suggest.
> >> >
> >> > Best,
> >> > Sandeep
> >> >
> >> > On Fri, Nov 9, 2018 at 5:45 PM Gaurav Gireesh <gaurav.gireesh@gmail.com
> >> >
> >> > wrote:
> >> >
> >> > > Dear MXNet Community,
> >> > >
> >> > > I recently started looking into performing some simple sound
> >> multi-class
> >> > > classification tasks with Audio Data and realized that as a user,
I
> >> would
> >> > > like MXNet to have an out of the box feature which allows us to load
> >> > audio
> >> > > data(at least 1 file format), extract features( or apply some common
> >> > > transforms/feature extraction) and train a model using the Audio
> >> Dataset.
> >> > > This could be a first step towards building and supporting APIs
> >> similar
> >> > to
> >> > > what we have for "vision" related use cases in MXNet.
> >> > >
> >> > > Below is the design proposal :
> >> > >
> >> > > Gluon - Audio Design Proposal
> >> > > <https://cwiki.apache.org/confluence/display/MXNET/Gluon+-+Audio>
> >> > >
> >> > > I would highly appreciate your taking time to review and provide
> >> > feedback,
> >> > > comments/suggestions on this.
> >> > > Looking forward to your support.
> >> > >
> >> > >
> >> > > Best Regards,
> >> > >
> >> > > Gaurav Gireesh
> >> > >
> >> >
> >> >
> >> > --
> >> > Sandeep Krishnamurthy
> >> >
> >>
> >
> 

Mime
View raw message