mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gaurav Gireesh <gaurav.gire...@gmail.com>
Subject Re: MXNet - Gluon - Audio
Date Tue, 20 Nov 2018 19:20:18 GMT
Hi All!
Following up on this PR:
https://github.com/apache/incubator-mxnet/pull/13241
I would need some comments or feedback regarding the API design :
https://cwiki.apache.org/confluence/display/MXNET/Gluon+-+Audio

The comments on the PR were mostly around *librosa *and its performance
being a blocker if and when the designed API can be tested with bigger ASR
models DeepSpeech 2, DeepSpeech 3.
I would appreciate if the community provides their expertise/knowledge on
loading audio data and feature extraction used currently with bigger ARS
models.
If there is anything in design which may be changed/improved that will
improve the performance, I ll be happy to look into this.

Thanks and regards,
Gaurav Gireesh

On Thu, Nov 15, 2018 at 10:47 AM Gaurav Gireesh <gaurav.gireesh@gmail.com>
wrote:

> Hi Lai!
> Thank you for your comments!
> Below are the answers to your comments/queries:
> 1) That's a good suggestion. However, I have added an example in the Pull
> request related to this:
> https://github.com/apache/incubator-mxnet/pull/13241/commits/eabb68256d8fd603a0075eafcd8947d92e7df27f
> .
> I would be happy to include a dataset similar to MNIST to support that. I
> have come across an example dataset used in tensor flow speech
> related example here
> <https://www.tensorflow.org/tutorials/sequences/audio_recognition>. This
> could be included.
>
> 2) Thank you for the suggestion, I shall look into the FFT operator that
> you have pointed out. However, there are other kind of features like, mfcc,
> mels and so on which are popular in audio data feature extraction, which
> will find utility if implemented. I am not sure if we have operators for
> this.
>
> 3) The references look good too. I shall look into them. Thank you for
> bringing them into my notice.
>
> Regards,
> Gaurav
>
> On Tue, Nov 13, 2018 at 11:22 AM Lai Wei <royweilai@gmail.com> wrote:
>
>> Hi Gaurav,
>>
>> Thanks for starting this. I see the PR is out
>> <https://github.com/apache/incubator-mxnet/pull/13241>, left some initial
>> reviews, good work!
>>
>> In addition to Sandeep's queries, I have the following:
>> 1. Can we include some simple classic audio dataset for users to directly
>> import and try out? like MNIST in vision. (e.g.:
>> http://pytorch.org/audio/datasets.html#yesno)
>> 2. Librosa provides some good audio feature extractions, we can use it for
>> now. But it's slow as you have to do conversions between ndarray and
>> numpy.
>> In the long term, can we make transforms to use mxnet operators and change
>> your transforms to hybrid blocks? For example, mxnet FFT
>> <
>> https://mxnet.apache.org/api/python/ndarray/contrib.html?highlight=fft#mxnet.ndarray.contrib.fft
>> >
>> operator
>> can be used in a hybrid block transformer, which will be a lot faster.
>>
>> Some additional references on users already using mxnet on audio, we
>> should
>> aim to make it easier and automate the file load/preprocess/transform
>> process.
>> 1. https://github.com/chen0040/mxnet-audio
>> 2. https://github.com/shuokay/mxnet-wavenet
>>
>> Looking forward to seeing this feature out.
>> Thanks!
>>
>> Best Regards
>>
>> Lai
>>
>>
>> On Tue, Nov 13, 2018 at 9:09 AM sandeep krishnamurthy <
>> sandeep.krishna98@gmail.com> wrote:
>>
>> > Thanks, Gaurav for starting this initiative. The design document is
>> > detailed and gives all the information.
>> > Starting to add this in "Contrib" is a good idea while we expect a few
>> > rough edges and cleanups to follow.
>> >
>> > I had the following queries:
>> > 1. Is there any analysis comparing LibROSA with other libraries? w.r.t
>> > features, performance, community usage in audio data domain.
>> > 2. What is the recommendation of LibROSA dependency? Part of MXNet PyPi
>> or
>> > ask the user to install if required? I prefer the latter, similar to
>> > protobuf in ONNX-MXNet.
>> > 3. I see LibROSA is a fully Python-based library. Are we getting
>> blocked on
>> > the dependency for future use cases when we want to make
>> transformations as
>> > operators and allow for cross-language support?
>> > 4. In performance design considerations, with lazy=True / False the
>> > performance difference is too scary ( 8 minutes to 4 hours!!) This
>> requires
>> > some more analysis. If we known turning a flag off/on has 24X
>> performance
>> > degradation, should we need to provide that control to user? What is the
>> > impact of this on Memory usage?
>> > 5. I see LibROSA has ISC license (
>> > https://github.com/librosa/librosa/blob/master/LICENSE.md) which says
>> free
>> > to use with same license notification. I am not sure if this is ok. I
>> > request other committers/mentors to suggest.
>> >
>> > Best,
>> > Sandeep
>> >
>> > On Fri, Nov 9, 2018 at 5:45 PM Gaurav Gireesh <gaurav.gireesh@gmail.com
>> >
>> > wrote:
>> >
>> > > Dear MXNet Community,
>> > >
>> > > I recently started looking into performing some simple sound
>> multi-class
>> > > classification tasks with Audio Data and realized that as a user, I
>> would
>> > > like MXNet to have an out of the box feature which allows us to load
>> > audio
>> > > data(at least 1 file format), extract features( or apply some common
>> > > transforms/feature extraction) and train a model using the Audio
>> Dataset.
>> > > This could be a first step towards building and supporting APIs
>> similar
>> > to
>> > > what we have for "vision" related use cases in MXNet.
>> > >
>> > > Below is the design proposal :
>> > >
>> > > Gluon - Audio Design Proposal
>> > > <https://cwiki.apache.org/confluence/display/MXNET/Gluon+-+Audio>
>> > >
>> > > I would highly appreciate your taking time to review and provide
>> > feedback,
>> > > comments/suggestions on this.
>> > > Looking forward to your support.
>> > >
>> > >
>> > > Best Regards,
>> > >
>> > > Gaurav Gireesh
>> > >
>> >
>> >
>> > --
>> > Sandeep Krishnamurthy
>> >
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message