CS 682 Speech Processing

/CS 682 Speech Processing
CS 682 Speech Processing 2017-12-16T16:59:34+00:00

Graded assignments

Due dates for assignments that require work to be turned in are posted on the calendar.  Use the ICAL address to add this to a personal calendar if you wish.

Graded assignments

Ungraded assignments

  • U1 – Learn the basics of Python 3.5.  You may wish to start with a quick video overview if you have never used Python.  From there, use one of the suggested books in the Materials section such as Martelli et al. (2017) or Reitz (2016), both of which are available online from the library.  The materials section also has information on setting up your development environment on a home machine and using it in departmental labs.


Please note that all dates except the final exam are tentative.  My primary concern is that you master the material and the schedule may be adjusted in either direction to optimize comprehension and scope of material.

Week of:

  1. Aug 29 – Introduction and elementary statistics, framing, RMS.
    Readings: statistics 3-3.9, framing (ch. 9 – 9.3.2, Jurafsky and Martin, 2009)
  2. Sept 5 – Pressure, Intensity, dB, Fourier transforms and spectrograms, features, and dimension reduction
    Readings: pressure, intensity, spectrograms of Rhode Island Graduate School of Oceanography and Marine Acoustics (2017), 9.3.3 Jurafsky and Martin (2009), principal components analysis 2-2.2.2 Dillon and Goldstein (1984)
  3. Sept 12 – Machine learning concepts (5-5.4), we will also introduce elements from Gradients and optimization (4), Thu Sept 14, no class
  4. Sept 19 – contd (5.5-onwards)
  5. Sept 26 – Deep feedforward networks (6-6.5.3)
  6. Oct 3 – contd., Speech perception (2.4, Rabiner and Juang, 1993)
  7. Oct 10 – Oct 17 – contd., EXAM I – Thursday Oct 19
  8. Oct 24 – Regularization (7-7.1, 7.4-7.5, 7.8-7.13) Oct 26 no class
  9. Oct 31 – Regularization contd., Sequence modeling (10 except 10.6, 10.8, 10.9.2-3)
  10. Nov 7 – Sequence modeling contd.
  11. Nov 14 – K Means, GMMs, Practical Methodology (11) and dolphins
  12. Nov 21 – contd., Nov. 23 Thanksgiving – no class
  13. Nov 28 – Sequence modeling contd.
  14. Dec 5 – Optimization (8-8.3.2), Use of deep nets in automatic speech recognition (12.3), Language models (12.4-12.4.4)
  15. Dec 12 – End to end large vocabulary speech recognition (Amodei et al., 2016)

Last day of classes:  Thursday, December 14, 2017.

Final exam:  Tuesday, December 19th, 10:30 – 12:30 in our usual class room.  No makeup exams will be given for students leaving town early without an excused absence.

Python quick intro:  https://www.youtube.com/watch?v=N4mEzFDjqtA

Unless otherwise specified, all readings are from Goodfellow et al. (2016) and list section numbers.  Remember that the text is available online in addition to print.


Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Cheng, Q., Chen, G., Chen, J., Chen, J., Chen, Z., Chrzanowski, M., Coates, A., Diamos, G., Ding, K., Du, N., Elsen, E., Engel, J., Fang, W., Fan, L., Fougner, C., Gao, L., Gong, C., Hannun, A., Han, T., Johannes, L., Jiang, B., Ju, C., Jun, B., LeGresley, P., Lin, L., Liu, J., Liu, Y., Li, W., Li, X., Ma, D., Narang, S., Ng, A., Ozair, S., Peng, Y., Prenger, R., Qian, S., Quan, Z., Raiman, J., Rao, V., Satheesh, S., Seetapun, D., Sengupta, S., Srinet, K., Sriram, A., Tang, H., Tang, L., Wang, C., Wang, J., Wang, K., Wang, Y., Wang, Z., Wang, Z., Wu, S., Wei, L., Xiao, B., Xie, W., Xie, Y., Yogatama, D., Yuan, B., Zhan, J., and Zhu, Z. (2016). “Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin,” in Proceedings of The 33rd International Conference on Machine Learning, edited by B. Maria Florina, and Q. W. Kilian (PMLR, Proceedings of Machine Learning Research, pp. 173–182.

Dillon, W. R., and Goldstein, M. (1984). Multivariate analysis, methods and applications (John Wiley & Sons, New York), pp. xii, 587

Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep learning (The MIT Press, Cambridge, Massachusetts), pp. xxii, 775 pages

Jurafsky, D., and Martin, J. H. (2009). Speech and Language Processing (Pearson Prentice Hall, Upper Saddle River, NJ)

Rabiner, L. R., and Juang, B.-H. (1993). Fundamentals of speech recognition (Prentice-Hall, Englewood Cliffs, NJ 07632)

Univ. of Rhode Island Graduate School of Oceanography, and Marine Acoustics, I. (2017). “Discovery of Sound in the Sea,” Accessed August 1, 2017. http://dosits.org.


Since the introduction of deep learning into speech processing by Dahl et al. (2010) and Deng et al. (2010) the field has changed rapidly and deep learning is now the dominant method used in speech processing applications.  As textbooks on deep learning are just starting to come out, we will be using a deep learning book supplemented with readings on speech.


  • Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep learning (The MIT Press, Cambridge, Massachusetts), pp. xxii, 775 pages.  Availble in print at bookstore or freely available online.

Programming exercises will be implemented using Python 3.5 and Tensorflow.  While we will briefly introduce Python in class, we will not be devoting much time to how to program in Python as computer science graduate students (and advanced undergraduates) should be able to pick up new languages fairly easily.  Optional books to help you with this are freely available from SDSU Library through Safari Technical Books:

  • Martelli, A., Ravenscroft, A., and Holden, S. (2017). Python in a Nutshell, 3rd Edition (O’Reilly Media, Inc, Sebastapol, CA) or
  • Reitz, K. (2016). The Hitchhiker’s Guide to Python: Best Practices for Development (O’Reilly Media, Sebastopol)

In addition, the python.org’s tutorial is also quite good.

Programming environment 

Python is rapidly becoming one of the most popular languages for machine learning.  This is primarily due to a large number of scientific libraries such as NumPy and SciPy coupled with popular machine learning language libraries such as Theano, TensorFlow, and PyTorch as well as higher-level interfaces such as keras.  In this class, we will use NumPy, SciPy, keras, and TensorFlow.  Anaconda (Austin, TX) is a company that offers a distribution that makes installing this large collection of libraries easier.   Anaconda and the appropriate libraries have been installed on the Windows machines in the Department lab (GMCS 425).  If you wish to install Anaconda on your own, please follow these instructions.  The instructions also show you how to run a program using the Spyder IDE once Anaconda is installed.

About the course:

You will master machine learning and signal processing skills.  We will apply this to recognizing speech and speaker identity, but many of the skills that you will acquire are useful in many contexts such as finance, bioinformatics, control systems, etc.

Upon successful completion of this class, students should be able to:

  • Understand feature extraction including automatic discovery of features.
  • Have an understanding of human speech production and perception.
  • Apply machine learning techniques to a variety of problems including those that require recognizing sequences.
  • Be able to write a scientific paper.
  • Be well-equipped to understand readings in the speech technologies literature

The prerequisites for this course are: Computer Science 310, Mathematics 254, and Statistics 551A.  As many CS students will not have taken Statistics 551A or linear algebra 254, this will be waived for any student who is willing to spend a bit of time learning the statistics, the basics of which will be covered briefly in class.

Please see syllabus for detailed course policies.