(Computer) Vision for Intelligent Robotics, Fall 2016

Course number: Info I590 / CS B659

Meets: Tuesday/Thursday 4:00-5:15pm

Location: Info 107

Website: http://homes.soic.indiana.edu/classes/fall2016/csci/b659-mryoo/

Instructor: Prof. Michael S. Ryoo

Email: mryoo "at" indiana.edu

Office: Informatics E259

Office hours: by appointment (send email)

***11/3 class will be replaced with the HRI seminar at Walnut Room, IMU***

Course description:

In this graduate seminar course, we will review and discuss state-of-the-art computer vision methodologies while also checking their applications to robots (i.e., robot perception). Specific topics will include object recognition, activity recognition, deep learning for both images and videos, and first-person vision for wearable devices and robots. The objective of the course is to understand important problems in computer vision and intelligent robotics, discuss advantages and disadvantages of existing approaches, and identify open questions and future research directions.


Interest in computer vision; basic programming skills; ability to read and understand conference papers. This course will focus on video-based techniques and their robotics applications, which will extend topics covered in other computer vision courses including B490/B659. Any previous experience in computer vision, machine learning, and robot vision will be a plus.

Please talk to me if you are unsure if the course is a good match for your background.

(tentative) Schedule:







Course introduction
Research overview and general background





1. Object recognition and Activity recognition


Image features, matching, and basic classification

Invariant local features, bag-of-visual words, spatial pyramid, ...


2 3


Doosti [1]
Elli [2], Zaman [3]


Object detection/segmentation

Histogram of gradients, deformable part models, graph-based segmentation, …

4 5



Ryoo [4], Iyer [5]
Boggaram [7]



Action recognition from videos

Hidden Markov models (HMM), space-time volumes, local XYT features, …

12 15

17 47



Lee [12], Jiang [15]

Kathawate [17], Wu [47]



Hierarchical activity recognition

Multi-layer HMMs, stochastic CFG, logic-based methods, ...


19 53



Khodadadi [21]

Ryoo [19], Varamesh [53]

2. Deep learning



Deep learning for images and objects

Convolutional neural networks (CNNs), CNN-segmentation, ...


10 48



Devadiga [8]

Meda [10], Kotak [48]



Deep learning for videos and events

CNNs for videos, recurrent neural networks (RNNs), ...


49 27



Zhang [24]

Shou [49], Maity [27]



More deep learning architectures/methods

Siamese neural networks, attention filters, region proposals, LSTMs, …




Naha [50]

Tosi [51], Schlegel [52]

Kotak [55], Spears [56]

3. Visual perception for robots



First-person object, action, and activity recognition

Object detection based first-person video understanding

Ego-action recognition and video summarization

First-person interaction recognition



Wu [29], Tosi [30]

Naha [31], Devadiga [32], Boggaram [33]



Learning “actionable” activity representations

Robot “learning from imitation”, syntactic approaches, ...




Zhang [35], Meda [36], Schlegel [37]

Lee [57,58]


Social cues and affordances
Detecting human gaze orientations from first-person videos

Action possibilities with objects and scene



Shou [38], Elli [54]

Kathawate [46]

4. Understanding surrounding environments


3-D scene understanding

Estimating 3-D scene geometry from images


Zaman [39], Doosti [40], Varamesh [41]


No class - Thanksgiving


Object and activity recognition using contextual information


Iyer [42], Maity [43], Shahivand [44]



Final project presentations

Course requirements and grading:

Paper/experiment presentations (30%): each student is expected to provide ~2 presentations throughout the course. A student may choose to provide either (1) paper presentation or (2) experiment presentation (i.e., presenting the results obtained by testing the method's code on existing datasets) for their presentations.

Paper review and class participation (20%): the students are required to choose a paper per class and submit its short review before the class.

Final project (50%): each student will choose his/her individual research topic and do research. This can be as simple as implementing several previous methods and comparing them, and can be as serious as proposing new concepts and algorithms, implementing them, and evaluating them with public datasets to advance the state-of-the-arts.


  1. D. G. Lowe, Distinctive Image Features from Scale-Invariant Keypoints. IJCV 2004.
  2. J. Sivic and A. Zisserman, Video Google: A Text Retrieval Approach to Object Matching in Videos. ICCV 2003.
  3. S. Lazebnik, C. Schmid, and J. Ponce, Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. CVPR 2006.
  4. Y. Jia, C. Huang, and T. Darrell, Beyond Spatial Pyramids: Receptive Field Learning for Pooled Image Features. CVPR 2012.
  5. N. Dalal and B. Triggs, Histograms of Oriented Gradients for Human Detection. CVPR 2005.
  6. P. Gehler and S. Nowozin, On Feature Combination for Multiclass Object Classification, ICCV 2009.
  7. P. Felzenszwalb,  D.  McAllester and D. Ramanan, A Discriminatively Trained, Multiscale, Deformable Part Model. CVPR 2008.
  8. A. Krizhevsky, I. Sutskever, and G. Hinton, Imagenet Classification with Deep Convolutional Neural Networks. NIPS 2012.
  9. C. Szegedy et al., Going Deeper with Convolutions. CVPR 2015.
  10. R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. CVPR 2014.
  11. J. Long, E. Shelhamer, T. Darrell, Fully Convolutional Networks for Semantic Segmentation. CVPR 2015.
  12. J. Yamato, J. Ohya, and K. Ishii, Recognizing Human Action in Time-Sequential Images Using Hidden Markov Model. CVPR 1992.
  13. N. Oliver, B. Rosario, and A. Pentland, A Bayesian Computer Vision System for modeling human interactions. T PAMI 2000.
  14. A. Bobick and J. Davis, The Recognition of Human Movement Using Temporal Templates. T PAMI 2001.
  15. M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri, Actions and Space-Time Shapes. ICCV 2005.
  16. I. Laptev, On Space-Time Interest Points. IJCV 2005.
  17. P. Dollár, V. Rabaud, G. Cottrell, and S. Belongie, Behavior Recognition via Sparse Spatio-Temporal Features. VS-PETS 2005.
  18. I. Laptev, M. Marszałek, C. Schmid, and B. Rozenfeld, Learning Realistic Human Actions from Movies. CVPR 2008.
  19. M. S. Ryoo and J. K. Aggarwal, Spatio-Temporal Relationship Match: Video Structure Comparison for Recognition of Complex Human Activities. ICCV 2009.
  20. Heng Wang, Cordelia Schmid, Action Recognition with Improved Trajectories, ICCV 2013.
  21. Y. Ivanov, and A. Bobick, Recognition of Visual Activities and Interactions by Stochastic Parsing. T PAMI 2000.
  22. J. M. Siskind, Grounding the Lexical Semantics of Verbs in Visual Perception using Force Dynamics and Event Logic. JAIR 2001.
  23. M. S. Ryoo and J. K. Aggarwal, Stochastic Representation and Recognition of High-level Group Activities, IJCV 2011.
  24. D. Tran et al., Learning Spatiotemporal Features with 3D Convolutional Networks. arXiv:1412.0767.
  25. A. Karpathy et al., Large-scale Video Classification with Convolutional Neural Networks. CVPR 2014.
  26. A. Graves, A. Mohamed, and G. Hinton, Speech Recognition with Deep Recurrent Neural Networks. ICASSP 2013.
  27. J. Ng et al., Beyond Short Snippets: Deep Networks for Video Classification. CVPR 2015.
  28. J. Donahue et al., Long-Term Recurrent Convolutional Networks for Visual Recognition and Description. CVPR 2015.
  29. A. Fathi, A. Farhadi, and J. M. Rehg, Understanding Egocentric Activities. ICCV 2011.
  30. H. Pirsiavash and D. Ramanan, Detecting Activities of Daily Living in First-Person Camera Views. In CVPR, 2012
  31. Y. J. Lee, J. Ghosh, and K. Grauman, Discovering Important People and Objects for Egocentric Video Summarization. CVPR 2012.
  32. K. M. Kitani, T. Okabe, Y. Sato, and A. Sugimoto, Fast Unsupervised Ego-action Learning for First-Person Sports Videos. CVPR 2011.
  33. M. S. Ryoo and L. Matthies, First-Person Activity Recognition: What Are They Doing to Me? CVPR 2013.
  34. M. S. Ryoo et al., Robot-Centric Activity Prediction from First-Person Videos: What Will They Do to Me? HRI 2015.
  35. P. Das et al., A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching. CVPR 2013.
  36. K. Lee et al., A syntactic approach to robot imitation learning using probabilistic activity grammars. RAS 2013.
  37. Y. Yang et al., Robot Learning Manipulation Action Plans by “Watching” Unconstrained Videos from the World Wide Web. AAAI 2015.
  38. A. Fathi, J. Hodgins, J. Rehg., Social Interactions: A First-Person Perspective. CVPR 2012.
  39. A. Saxena, M. Sun and A. Y. Ng, Make3D: Learning 3D Scene Structure from a Single Still Image. T PAMI 2009.
  40. A. Gupta, A. Efros, and M. Hebert, Blocks World Revisited: Image Understanding Using Qualitative Geometry and Mechanics. ECCV 2010.
  41. D. Lin, S. Fidler, and R. Urtasun, Holistic Scene Understanding for 3D Object Detection with RGBD cameras. ICCV 2013.
  42. D. Hoiem, A.A. Efros, and M. Hebert, Putting Objects in Perspective. CVPR 2006, IJCV 2008.
  43. Y. J. Lee and K. Grauman, Object-Graphs for Context-Aware Category Discovery, CVPR 2010.
  44. A. Gupta and L. Davis, Objects in Action: An Approach for Combining Action Understanding and Object Perception. CVPR 2007.
  45. M. Marszalek, I. Laptev, and C. Schmid, Actions in Context, CVPR 2009.
  46. H. S. Koppula and A. Saxena, Physically Grounded Spatio-Temporal Object Affordances, ECCV 2014.
  47. J. C. Niebles, H. Wang, and L. Fei-Fei, Unsupervised learning of human action categories using spatial-temporal words, IJCV 2008.
  48. J. Long, E. Shelhamer, T. Darrell, Fully Convolutional Networks for Semantic Segmentation, CVPR 2015.
  49. K. Simonyan and A. Zisserman, Two-Stream Convolutional Networks for Action Recognition in Videos, NIPS 2014.
  50. S. Yeung, O. Russakovsky, G. Mori, and L. Fei-Fei, End-to-end Learning of Action Detection from Frame Glimpses in Videos, CVPR 2016.
  51. S. Bell and K. Bala, Learning visual similarity for product design with convolutional neural networks, SIGGRAPH 2015.
  52. K. Gregor, I. Danihelka, A. Graves, D. Jimenez Rezende, and D. Wierstra, DRAW: A Recurrent Neural Network For Image Generation, arXiv:1502.04623
  53. T. Lan, L. Sigal, G. Mori, Social Roles in Hierarchical Models for Human Activity Recognition, CVPR 2012.
  54. Y. Li, A. Fathi, J. Rehg. Learning to Predict Gaze in Egocentric Video, ICCV 2013.
  55. M. Ibrahim et al., A Hierarchical Deep Temporal Model for Group Activity Recognition, CVPR 2016.
  56. S. Ren, K. He, R. Girshick, and J. Sun, Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, NIPS 2015.
  57. S. Levine, C. Finn, T. Darrell, and P. Abbeel, End-to-End Training of Deep Visuomotor Policies, Journal of Machine Learning Research, 2016.
  58. S. Levine, P. Pastor, A. Krizhevsky, and D. Quillen, Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection, arXiv:1603.02199