CS 682 Speech Processing

/CS 682 Speech Processing
CS 682 Speech Processing2018-12-02T14:22:08+00:00

Graded assignments

Due dates for assignments that require work to be turned in are posted on the calendar.  Use the ICAL address to add this to a personal calendar if you wish.

Graded assignments

  • A1 – Statistics,DFT
  • A2 – DFT, PCA, GMM
  • Read and write a 2 page summary:   Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. and Salakhutdinov, R. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. J Mach Learn Res 15, 1929-1958.
  • Lab 1
  • A3 – Speech production, perception
  • Lab 2 and Reading Summary 2

 Ungraded assignments

  • U1 – Learn the basics of Python 3.6.  You may wish to start with a quick video overview if you have never used Python.  From there, use one of the suggested books in the Materials section such as Martelli et al. (2017) or Reitz (2016), both of which are available online from the library.  The materials section also has information on setting up your development environment on a home machine and using it in departmental labs.

Calendar

Please note that all dates except the final exam are tentative.  My primary concern is that you master the material and the schedule may be adjusted in either direction to optimize comprehension and scope of material.

Week of:

  1. Aug 28 – Introduction and elementary statistics, framing, RMS.
    Readings: statistics 3-3.9, framing (ch. 9 – 9.3.2, Jurafsky and Martin, 2009)
  2. Sept 4 – Pressure, Intensity, dB, Fourier transforms and spectrograms, features, and dimension reduction
    Readings: pressure, intensity, spectrograms of Rhode Island Graduate School of Oceanography and Marine Acoustics (2017), 9.3.3 Jurafsky and Martin (2009), principal components analysis 2-2.2.2 Dillon and Goldstein (1984)
  3. Sept 11 – Machine learning concepts (5-5.4), we will also introduce elements from Gradients and optimization (4),
  4. Sept 18 – contd (5.5-onwards)
  5. Sept 25* – Deep feedforward networks (6-6.5.3)
  6. Oct 2 – contd., Regularization (7-7.5), dropout (Srivastava et al., 2014)
  7. Oct 9 – Speech perception (2.4, Rabiner and Juang, 1993)
  8. Oct 16 – contd., EXAM I – Tuesday Oct 1 6
  9. Oct 24 – Optimization (8-8.3.2)
  10. Oct 30 – Sequence modeling (10-10.2.2, 10.10)
  11. Nov 6* – Sequence modeling and Practical Methodology (11.1-5)
  12. Nov 13 – Use of deep nets in automatic speech recognition (12.3)
  13. Nov 20 – Language models (Jurafsky and Martin 12.4-12.4.4, Gale and Sampson, 1995)
    22 Thanksgiving – no class
  14. Nov 27 – contd.
  15. Dec 4* – End to end large vocabulary speech recognition (Amodei et al., 2016)
  16. Dec 11 – contd. (Tuesday December 11 is the last day of our class.)

Fall Semester last day of classes:  Wednesday, December 12, 2017.

Dr. Kaitlin Palmer will be the instructor during weeks marked with an asterisk.

Final exam:  Tuesday, December 13th, 10:30 – 12:30 in our usual class room.  No makeup exams will be given for students leaving town early without an excused absence.

Python quick intro:  https://www.youtube.com/watch?v=N4mEzFDjqtA

Unless otherwise specified, all readings are from Goodfellow et al. (2016) and list section numbers.  Remember that the text is available online in addition to print.

References

Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Cheng, Q., Chen, G., Chen, J., Chen, J., Chen, Z., Chrzanowski, M., Coates, A., Diamos, G., Ding, K., Du, N., Elsen, E., Engel, J., Fang, W., Fan, L., Fougner, C., Gao, L., Gong, C., Hannun, A., Han, T., Johannes, L., Jiang, B., Ju, C., Jun, B., LeGresley, P., Lin, L., Liu, J., Liu, Y., Li, W., Li, X., Ma, D., Narang, S., Ng, A., Ozair, S., Peng, Y., Prenger, R., Qian, S., Quan, Z., Raiman, J., Rao, V., Satheesh, S., Seetapun, D., Sengupta, S., Srinet, K., Sriram, A., Tang, H., Tang, L., Wang, C., Wang, J., Wang, K., Wang, Y., Wang, Z., Wang, Z., Wu, S., Wei, L., Xiao, B., Xie, W., Xie, Y., Yogatama, D., Yuan, B., Zhan, J., and Zhu, Z. (2016). “Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin,” in Proceedings of The 33rd International Conference on Machine Learning, edited by B. Maria Florina, and Q. W. Kilian (PMLR, Proceedings of Machine Learning Research, pp. 173–182.

Dillon, W. R., and Goldstein, M. (1984). Multivariate analysis, methods and applications (John Wiley & Sons, New York), pp. xii, 587

Gale, W. A., and Sampson, G. (1995). “Good‐turing frequency estimation without tears,” Journal of Quantitative Linguistics 2(3). 217-237.

Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep learning (The MIT Press, Cambridge, Massachusetts), pp. xxii, 775 pages

Jurafsky, D., and Martin, J. H. (2009). Speech and Language Processing (Pearson Prentice Hall, Upper Saddle River, NJ)

Rabiner, L. R., and Juang, B.-H. (1993). Fundamentals of speech recognition (Prentice-Hall, Englewood Cliffs, NJ 07632)

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). “Dropout: A Simple Way to Prevent Neural Networks from Overfitting,” J Mach Learn Res 15. 1929-1958.

Univ. of Rhode Island Graduate School of Oceanography, and Marine Acoustics, I. (2017). “Discovery of Sound in the Sea,” Accessed August 1, 2017. http://dosits.org.

Textbooks  

Since the introduction of deep learning into speech processing by Dahl et al. (2010) and Deng et al. (2010) the field has changed rapidly and deep learning is now the dominant method used in speech processing applications.  As textbooks on deep learning are just starting to come out, we will be using a deep learning book supplemented with readings on speech.

Required:

  • Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep learning (The MIT Press, Cambridge, Massachusetts), pp. xxii, 775 pages.  Availble in print at bookstore or freely available online.

Programming exercises will be implemented using Python 3.6 and Tensorflow.  While we will briefly introduce Python in class, we will not be devoting much time to how to program in Python as computer science graduate students (and advanced undergraduates) should be able to pick up new languages fairly easily.  Optional books to help you with this are freely available from SDSU Library through Safari Technical Books:

  • Martelli, A., Ravenscroft, A., and Holden, S. (2017). Python in a Nutshell, 3rd Edition (O’Reilly Media, Inc, Sebastapol, CA) or
  • Reitz, K. (2016). The Hitchhiker’s Guide to Python: Best Practices for Development (O’Reilly Media, Sebastopol)

In addition, the python.org’s tutorial is also quite good.

Programming environment 

Python is rapidly becoming one of the most popular languages for machine learning.  This is primarily due to a large number of scientific libraries such as NumPy and SciPy coupled with popular machine learning language libraries such as Theano, TensorFlow, and PyTorch as well as higher-level interfaces such as keras.  In this class, we will use NumPy, SciPy, keras, and TensorFlow.  Anaconda (Austin, TX) is a company that offers a distribution that makes installing this large collection of libraries easier.   Anaconda and the appropriate libraries have been installed on the Windows machines in the Department lab (GMCS 425).  If you wish to install Anaconda on your own, please follow these instructions.  The instructions also show you how to run a program using eclipse (preferred) or the Spyder IDE once Anaconda is installed.

About the course:

You will master machine learning and signal processing skills.  We will apply this to recognizing speech and speaker identity, but many of the skills that you will acquire are useful in many contexts such as finance, bioinformatics, control systems, etc.

Upon successful completion of this class, students should be able to:

  • Understand feature extraction including automatic discovery of features.
  • Have an understanding of human speech production and perception.
  • Apply machine learning techniques to a variety of problems including those that require recognizing sequences.
  • Be able to write a scientific paper.
  • Be well-equipped to understand readings in the speech technologies literature

The prerequisites for this course are: Computer Science 310, Mathematics 254, and Statistics 551A.  As many CS students will not have taken Statistics 551A or linear algebra 254, this will be waived for any student who is willing to spend a bit of time learning the statistics, the basics of which will be covered briefly in class.

Please see syllabus for detailed course policies.