APSIPA Distinguished Lecturers ( 1 January 2012 - 31 December 2013 )
Abeer Alwan, UCLA, USA

Abeer Alwan received her Ph.D. in EECS from MIT in 1992. Since then, she has been with the Electrical Engineering Department at UCLA as an Assistant Professor (1992-1996), Associate Professor (1996-2000), Professor (2000-present), Vice Chair of the BME program (1999-2001), Vice Chair of EE Graduate Affairs (2003-2006), and Area Director of Signals and Systems (2006-2010). She established and directs the Speech Processing and Auditory Perception Laboratory at UCLA (http://www.ee.ucla.edu/~spapl). Her research interests include modeling human speech production and perception mechanisms and applying these models to improve speech-processing applications such as noise-robust automatic speech recognition. She is the recipient of the NSF Research Initiation Award (1993), the NIH FIRST Career Development Award (1994), the UCLA-TRW Excellence in Teaching Award (1994), the NSF Career Development Award (1995), and the Okawa Foundation Award in Telecommunications (1997). Dr. Alwan is an elected member of Eta Kappa Nu, Sigma Xi, Tau Beta Pi, and the New York Academy of Sciences. She served, as an elected member, on the Acoustical Society of America Technical Committee on Speech Communication (1993-1999, and 2005-2008), on the IEEE Signal Processing Technical Committees on Audio and Electroacoustics (1996-2000) and on Speech Processing (1996-2001, 2005-2008, 2011-2013). She is a member of the Editorial Board of Speech Communication and was an editor-in-chief of that journal (2000-2003), was an Associate Editor (AE) of the IEEE Transactions on Speech, Audio, and Language Processing (2006-2009), and is an AE for the Journal of the Acoustical Society of America (JASA). Dr. Alwan is a Fellow of the IEEE, the Acoustical Society of America, and the Internatinoal Speech Communication Association (ISCA). She was a 2006-2007 Fellow of the Radcliffe Institute for Advanced Study at Harvard University, and a Distinguished Lecturer for ISCA.


Lecture 1: Dealing with Noisy and Limited Data: A Hybrid Approach
This talk builds on Dr. Alwan's Keynote Speech at Interspeech 2008. It surveys the field and presents and compares state-of-the-art techniques. Areas of interest include noise-robust ASR and rapid speaker adaptation for both native and non-native speakers of English. It will show how linguistically-motivated, auditorally-inspired, and speech production-based models can improve performance and lead to greater insights.

Lecture 2: Models of Speech Production and Perception and Applications in Speech and Audio Coding, TTS, and Hearing Aids
This technical talk discusses the potential value of signal processing algorithms that are based on models of how humans produce and perceive speech with a focus on models of speech perception in noise. It then surveys applications which have benefited tremendously from such models. Applications include speech and audio coding (e.g., CELP-based techniques which have benefited from simplified models of speech production, and MPEG which benefited from modeling aspects of auditory perception), text-to-speech synthesis, and hearing aids as well as cochlear implants.

Lecture 3: Production, Analysis, and Perception of Voice Quality

Voice quality is due in part to patterns of vibration of a speaker's vocal folds inside the larynx. In some languages, different voice qualities can distinguish word meanings. I can talk about our studies of Voice Quality which includes production and perception studies as well as acoustic measurements of voice contrasts in various languages.

Mrityunjoy Chakraborty, Indian Institute of Technology, India

Mrityunjoy Chakraborty obtained Bachelor of Engg. from Jadavpur university, Calcutta, Master of Technology from IIT, Kanpur and Ph.D. from IIT, Delhi. He joined IIT, Kharagpur as a faculty member in 1994, where he currently holds the position of a professor in Electronics and Electrical Communication Engg. The teaching and research interests of Prof. Chakraborty are in Digital and Adaptive Signal Processing, VLSI Signal Processing, Linear Algebra and DSP applications in Wireless Communications. In these areas, Prof. Chakraborty has supervised several graduate theses, carried out independent research and has several well cited publications.

Prof. Chakraborty has been an Associate Editor of the IEEE Transactions on Circuits and Systems, part I (2004-2007, 2010-2011, 2012) and part II (2008-2009), apart from being an elected member of the DSP TC of the IEEE Circuits and Systems Society, a guest editor of the EURASIP JASP (special issue) and a TPC member of ICC (2007-2011) and Globecom (2008-2011). Prof. Chakraborty is co-founder of the Asia Pacific Signal and Information Processing Association (APSIPA), a member of the APSIPA steering committee and also, the chair of the APSIPA TC on Signal and Information Processing Theory and Methods (SIPTM). He has also been the general chair and also the TPC chair of the National Conference on Communications ˇV 2012.

Prof. Chakraborty is a fellow of the Indian National Academy of Engineering (INAE) and also a fellow of the IETE.


Lecture 1: A SPT Treatment to the Realization of the Sign-LMS Based Adaptive Filters
The "sum of power of two (SPT)" is an effective format to represent filter coefficients in a digital filter which reduces the complexity of multiplications in the filtering process to just a few shift and add operations. The canonic SPT is a special sparse SPT representation that guarantees presence of at least one zero between every two non-zero SPT digits. In the case of adaptive filters, as the coefficients are updated with time continuously, conversion to such canonic SPT forms is, however, required at each time index, which is quite impractical and requires additional circuitry. Also, as the position of the non-zero SPT terms in the canonic SPT expression of each coefficient word changes with time, it is not possible to carry out multiplications involving the coefficients via a few "shift and add" operations. This seminar addresses these issues, in the context of a SPT based realization of adaptive filters belonging to the sign-LMS family. Firstly, it proposes a bit serial adder that takes as input two numbers, one (filter weights) in canonic SPT and the other (data) in 2's complement form, producing an output also in canonic SPT, which allows weight updating purely in the canonic SPT domain. It is also shown how the canonic SPT property of the input can be used to reduce the complexity of the proposed adder. For multiplication, the canonic SPT word for each coefficient is partitioned into non-overlapping digit pairs and the data word is multiplied by each pair separately. The fact that each pair can have at the most one non-zero digit is exploited further to reduce the complexity of the multiplication.

Lecture 2: Adaptive Identification of Sparse Systems - a Convex Combination Approach
In the context of system identification, it is shown that sometimes the level of sparseness in the system impulse response can vary greatly depending on the time-varying nature of the system. When the response is strongly sparse, convergence of the conventional approach such as least mean square (LMS) is poor. The recently proposed, compressive sensing based sparsity-aware ZA-LMS algorithm performs satisfactorily in strongly sparse environments, but is shown to perform worse than the conventional LMS when sparseness of the impulse response reduces. In this lecture, we present an algorithm which works well both in sparse and non-sparse circumstances and adapts dynamically to the level of sparseness, using a convex combination based approach. The proposed algorithm is supported by simulation results that show its robustness against variable sparsity.

Lecture 3: A Low Complexity Realization of the Sign-LMS Algorithm using a Constrained, Minimally Redundant, Radix-4 Arithmetic

The sign-LMS algorithm is a popular adaptive filter that requires only addition/subtraction but no multiplication in the weight update loop. To reduce the complexity of multiplication that arises in the filtering part of the sign-LMS algorithm, a special radix-4 format is presented in this paper to represent each filter coefficient. The chosen format guarantees sufficient sparsity which in turn reduces the multiplicative complexity as no partial product needs to be computed when the multiplicand is a binary zero. Care, is, however, taken to ensure that the weight update process generates the updated weight also in the same chosen radix-4 format, which is ensured by developing an algorithm for adding a 2's complement number with a number given in the adopted radix-4 format.

Lecture 4: New Algorithms for Multiplication and Addition of CSD and 2's Complement Numbers
The CSD is a powerful sparse representation of digital data that helps in reducing the complexity of multiplications in a digital filter, by evaluating only those partial products that correspond to the non-zero terms in the CSD word. This talk will present new algorithms and architectures for adding and multiplying CSD data with 2's complement words, where the canonic property of the CSD data is used to reduce the complexity of the implementation effectively.

Lecture 5: Compressed Sensing and Sparse System Identification
Recent emergence of the topic of "Compressed Sensing" has generated a renewed dynamism in the area of sparse adaptive filters and sparse system identification. This talk will provide a review of the recent developments and trends in this area.

Lecture 6: Adaptive Estimation of Delay and Amplitude of Sinusoidal Signals
In this seminar, we present a new adaptive filter for estimating and tracking the delay and the relative amplitude of a sinusoid vis-a-vis a reference sinusoid of the same frequency. By careful choice of the sampling period, a two-tap FIR filter model is constructed for the delayed signal. The delay and the amplitude are estimated by identifying the FIR filter for which a delay variable and an amplitude variable are updated in a LMS like manner, deploying, however, separate step sizes. Convergence analysis proving convergence (in mean) of the delay and the amplitude updates to their respective true values will be discussed and MATLAB based simulation studies confirming satisfactory estimation performance of the proposed algorithm will be presented.

Lecture 7: APSIPA and its mission and vision
This talk will introduce APSIPA and its present activities as well as it short and long term missions to the audience.

Jen-Tzung Chien, National Chiao Tung University, Taiwan

Jen-Tzung Chien received his Ph.D. degree in electrical engineering from National Tsing Hua University, Hsinchu, Taiwan, in 1997. During 1997-2012, he was with the Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan. Since 2012, he has been with the Department of Electrical and Computer Engineering, National Chiao Tung University, Hsinchu, where he is currently a Professor. He held the Visiting Researcher positions at the Panasonic Technologies Inc., Santa Barbara, CA, the Tokyo Institute of Technology, Tokyo, Japan, the Georgia Institute of Technology, Atlanta, GA, the Microsoft Research Asia, Beijing, China, and the IBM T. J. Watson Research Center, Yorktown Heights, NY. His research interests include machine learning, speech recognition, blind source separation, face recognition, and information retrieval.

Dr. Chien is a senior member of the IEEE Signal Processing Society. He served as the associate editor of the IEEE Signal Processing Letters, in 2008-2011, and the tutorial speaker of the ICASSP, in 2012. He is appointed as the APSIPA Distinguished Lecturer for 2012-2013. He was a co-recipient of the Best Paper Award of the IEEE Automatic Speech Recognition and Understanding Workshop in 2011. He received the Young Investigator Award (Ta-You Wu Memorial Award) from the National Science Council (NSC), Taiwan, in 2003, the Research Award for Junior Research Investigators from Academia Sinica, Taiwan, in 2004, and the NSC Distinguished Research Awards, in 2006 and 2010.


Lecture 1: Machine Learning for Speech and Language Processing
In this lecture, I will present a series of machine learning approaches to various applications relevant to speech and language processing including acoustic modelling, language modelling, speech recognition, blind source separation, document summarization, information retrieval, and natural language understanding. In general, speech and language processing involves extensive knowledge of statistical models which are learnt from observation data. However, in real world, observation data are inevitably acquired from heterogeneous environments in presence of mislabeled, misaligned, mismatched and ill-posed conditions. The estimated models suffer from large complexity, ambiguity and uncertainty. Model regularization becomes a crucial issue when constructing the speech and text models for different information systems. In statistical machine learning, the uncertainty and sparse coding algorithms provide attractive and effective solution to model regularization. This lecture will address several recent works on Bayesian and sparse learning. In particular, I will present Bayesian sensing hidden Markov models and Dirichlet class language models for speech recognition, online Gaussian process for blind source separation, unsupervised structural learning for text representation, and Bayesian nonparametrics for document summarization. In these works, robust models are established against improper model assumption, over-determined model complexity, ambient noise interference, and nonstationary environment variations. Finally, I will point out some potential topics on machine learning for speech and language processing.

Lecture 2: Independent Component Analysis and Unsupervised Learning
Independent component analysis (ICA) is not only popular for blind source separation (BSS) but also for unsupervised learning of salient features underlying the mixed observations. In speech signals, these features may represent the specific speaker, gender, accent, noise or environment, and can act as the basis functions to span the vector space of the human voices in different conditions. In this lecture, I will present recent works on ICA and BSS and their applications in audio signal separation and speech recognition. These works include independent voices for speaker adaptation, information-theoretic learning based on convex ICA, and nonstationary source separation via online Gaussian process. Several machine learning algorithms are developed to deal with the issues of model selection, model optimization, model variations, nonstationary process, online learning, nonparametric modelling, etc. Further researches on unsupervised learning and structural learning based on topic modelling will be addressed.

Li Deng, Microsoft Research, USA

Dr. Li Deng received the Ph.D. from Univ. Wisconsin-Madison. He was an Assistant (1989-1992), Associate (1992-1996), and Full Professor (1996-1999) at the University of Waterloo, Ontario, Canada. He then joined Microsoft Research, Redmond, where he is currently a Principal Researcher and where he received Microsoft Research Technology Transfer, Goldstar, and Achievement Awards. Prior to MSR, he also worked or taught at Massachusetts Institute of Technology, ATR Interpreting Telecom. Research Lab. (Kyoto, Japan), and HKUST. He has published over 300 refereed papers in leading journals/conferences and 3 books covering broad areas of human language technology, machine learning, and audio, speech, and signal processing. He is a Fellow of the Acoustical Society of America, a Fellow of the IEEE, and a Fellow of the International Speech Communication Association. He is an inventor or co-inventor of over 50 granted US, Japanese, or international patents. He served on the Board of Governors of the IEEE Sig. Proc. Soc. (2008-2010). More recently, he served as Editor-in-Chief for IEEE Signal Processing Magazine (2009-2011), which, according to the Thompson Reuters Journal Citation Report released 2010 and 2011, ranked first in both years among all 127 IEEE publications and all 247 publications within the Electrical and Electronics Engineering Category worldwide in terms of its impact factor, and for which he received the 2011 IEEE SPS Meritorious Service Award. He currently serves as Editor-in-Chief for IEEE Transactions on Audio, Speech and Language Processing. His recent tutorials on deep learning at APSIPA (Oct 2011) and at ICASSP (March 2012) received the highest attendance rate at both conferences.


Lecture 1: Being Deep and Being Dynamic - New-Generation Models and Methodology for Advancing Speech Technology
Semantic information embedded in the speech signal --- not only the phonetic/linguistic content but also a full range of paralinguistic information including speaker characteristics --- manifests itself in a dynamic process rooted in the deep linguistic hierarchy as an intrinsic part of the human cognitive system. Modeling both the dynamic process and the deep structure for advancing speech technology has been an active pursuit for over more than 20 years, but it is not until recently (since only a few years ago) that noticeable breakthrough has been achieved by the new methodology commonly referred to as ˇ§deep learningˇ¨. Deep Belief Net (DBN) is recently being used to replace the Gaussian Mixture Model (GMM) component in HMM-based speech recognition, and has produced dramatic error rate reduction in both phone recognition and large vocabulary speech recognition while keeping the HMM component intact. On the other hand, the (constrained) Dynamic Bayesian Net (referred to as DBN* here) has been developed for many years to improve the dynamic models of speech while overcoming the IID assumption as a key weakness of the HMM, with a set of techniques and representations commonly known as hidden dynamic/trajectory models or articulatory-like models. A history of these two largely separate lines of ˇ§DBN/DBN*ˇ¨ research will be critically reviewed and analyzed in the context of modeling deep and dynamic linguistic hierarchy for advancing speech (as well as speaker) recognition technology. Future directions will be discussed for this exciting area of research that holds promise to build a foundation for the next-generation speech technology with human-like cognitive ability.

Lecture 2: Feature-Domain, Model-Domain, and Hybrid Approaches to Noise-Robust Speech Recognition
Noise robustness has long been an active area of research that captures significant interest from speech recognition researchers and developers. In this lecture, we use the Bayesian framework as a common thread to connect, analyze, and categorize a number of popular approaches to noise robust speech recognition pursued in the recent past. The topics covered in this lecture include: 1) Bayesian decision rules with unreliable features and unreliable model parameters; 2) Principled ways of computing feature uncertainty using structured speech distortion models; 3) Use of phase factor in an advanced speech distortion model for feature compensation; 4) A novel perspective on model compensation as a special implementation of the general Bayesian predictive classification rule capitalizing on model parameter uncertainty; 5) Taxonomy of noise compensation techniques using two distinct axes: feature vs. model domain and structured vs. unstructured transformation; and 6) Noise adaptive training as a hybrid feature-model compensation framework and its various forms of extension.

Lecture 3: Machine Learning Paradigms for speech recognition
Automatic Speech Recognition (ASR) has historically been a driving force behind many machine learning techniques, including the ubiquitously used hidden Markov model, discriminative learning, Bayesian learning, and adaptive learning. Moreover, machine learning can and occasionally does use ASR as a large-scale, realistic application to rigorously test the effectiveness of a given technique, and to inspire new problems arising from the inherently temporal nature of speech. On the other hand, even though ASR is available commercially for some applications, it is in general a largely unsolved problem - for many applications, the performance of ASR is not yet on par with human performance. New insight from modern machine learning methodology shows great promise to advance the state-of-the-art in ASR technology performance.
This lecture provides audience with an overview of modern machine learning techniques as utilized in current ASR research and systems. The intent of the lecture is to foster further cross-pollination between the machine learning and speech recognition communities than what has occurred in the past. The lecture is organized according to the major machine learning paradigms that are either popular already in or have potential for making significant contributions to ASR technology. The paradigms presented and elaborated in this lecture include generative and discriminative learning; supervised, unsupervised, semisupervised, and active learning; and adaptive and multitask learning. These learning paradigms are motivated and discussed in the context of ASR applications. I will finally present and analyse recent developments of deep learning, sparse representations, and combinatorial optimization focusing on their direct relevance to advancing ASR technology.

Hsueh-Ming Hang, National Chiao-Tung University, Taiwan

Hsueh-Ming Hang received the B.S. and M.S. degrees in control engineering and electronics engineering from National Chiao Tung University, Hsinchu, Taiwan, in 1978 and 1980, respectively, and Ph.D. in electrical engineering from Rensselaer Polytechnic Institute, Troy, NY, in 1984.

From 1984 to 1991, he was with AT&T Bell Laboratories, Holmdel, NJ, and then he joined the Electronics Engineering Department of National Chiao Tung University (NCTU), Hsinchu, Taiwan, in December 1991. From 2006 to 2009, he took a leave from NCTU and was appointed as Dean of the EECS College at National Taipei University of Technology (NTUT). He is currently a Distinguished Professor of the EE Dept at NCTU and an associate dean of the ECE College, NCTU. He has been actively involved in the international MPEG standards since 1984 and his current research interests include multimedia compression, image/signal processing algorithms and architectures, and multimedia communication systems.

Dr. Hang holds 13 patents (Taiwan, US and Japan) and has published over 180 technical papers related to image compression, signal processing, and video codec architecture. He was an associate editor (AE) of the IEEE Transactions on Image Processing (TIP, 1992-1994), the IEEE Transactions on Circuits and Systems for Video Technology (1997-1999), and currently an AE of the IEEE TIP again. He is co-editor and contributor of the Handbook of Visual Communications published by Academic Press. He is a recipient of the IEEE Third Millennium Medal and is a Fellow of IEEE and IET and a member of Sigma Xi.


Lecture 1: Technology and Trends in Multi-camera Virtual-view Systems
3D video products are aggressively growing recently. One step further, the virtual-viewpoint (or free-viewpoint) video becomes the research focus. It is also an on-going standardization item of the international MPEG Committee. Its aim is to define an efficient data representation for multi-view (virtual-view) synthesis at the receiver, which can be a multi-view autostereoscopic display. The December 2011 3DVC contest results indicate that such a system is plausible. In practice, a densely arranged camera array is used to acquire input images and a virtual view is synthesized by using the depth-image based rendering (DIBR) technique. Two essential tools are needed for a virtual-view synthesis system: depth estimation and view synthesis. We will summarize the recent progress and future trend on this subject. Some of our work is included in this report.

Lecture 2: What's Next on Video Coding Technologies and Standards?
After the profound success of defining H.264/AVC video coding standard in 2002, in the past a few years, the ITU-T Video Coding Experts Group (VCEG) and ISO/IEC Motion Picture Expert Group (MPEG) have been actively searching for new or improved technologies that can achieve an even higher compression efficiency. After several years' struggle, in January 2010, VCEG and MPEG formed a joint team and issued a call-for-proposal for the "High Efficiency Video Coding (HEVC)". This standard work item attracts a lot of attention and it progresses very well in the past 14 months. We thus expect a "new" video standard will be specified in 2012. In addition, 3D video products are aggressively growing recently. One step further, the free-viewpoint video becomes the next MPEG standardization item. Its aim is to define an efficient data representation for multi-view (free-view) synthesis at the receiver, which can be a multi-view auto-stereoscopic display.

Kyoung Mu Lee, Seoul National University, Korea

Kyoung Mu Lee received the B.S. and M.S. degrees in Control and Instrumentation Engineering from Seoul National University (SNU), Seoul, Korea in 1984 and 1986, respectively, and Ph. D. degree in Electrical Engineering from the USC (University of Southern California), Los Angeles, California in 1993. He has been awarded the Korean Government Overseas Scholarship during his Ph. D. courses. From 1993 to 1994 he was a research associate in the SIPI (Signal and Image Processing Institute) at USC. He was with the Samsung Electronics Co. Ltd. in Korea as a senior researcher from 1994 to 1995. On August 1995, he joined the department of Electronics and Electrical Eng. of the Hong-Ik University, and worked as an assistant and associate professor. From September 2003, he is with the Department of Electrical Engineering and Computer Science at Seoul National University as a professor, and leads the Computer Vision Laboratory. His primary research is focused on statistical methods in computer vision that can be applied to various applications including object recognition, segmentation, tracking and 3D reconstruction. Prof. Lee has received several awards, in particular, the Most Influential Paper over the Decade Award by the IAPR Machine Vision Application in 2009, the ACCV Honorable Mention Award in 2007, the Okawa Foundation Research Grant Award in 2006, and the Outstanding Research Award by the College of Engineering of SNU in 2010. He served as an Editorial Board member of the EURASIP Journal of Applied Signal Processing, and is an associate editor of the Machine Vision Application Journal, the IPSJ Transactions on Computer Vision and Applications, the Journal of Information Hiding and Multimedia Signal Processing, and IEEE Signal Processing Letters. He has (co)authored more than 120 publications in refereed journals and conferences including PAMI, IJCV, CVPR, ICCV and ECCV.


Lecture 1: Statistical Sampling Approaches for Visual Tracking
Object tracking is one of the important and fundamental problems in Computer Vision. It is a challenging problem to track a target in the real world tracking environment where different types of variations such as illumination, shape, occlusion, or motion changes occur at the same time. Recently, several attempts have been made to solve the problem, but still the results are far from satisfactory due to their inherent limited modeling. In this talk, to cope with this challenging problem we present a novel approach based on a statistical sampling framework. The underlying philosophy of our approach is that multiple trackers can be constructed and integrated efficiently in a probabilistic way. With a sampling method, the trackers themselves are sampled, as well as the states of the targets. The trackers are adapted or newly constructed depending on the current situation, so that each specific tracker takes charge of a certain change of the object. They are efficiently sampled using the Markov Chain Monte Carlo method according to the appearance models, motion models, state representation types, and observation types. All trackers are then integrated into one compound tracker through an interactive Markov Chain Monte Carlo (IMCMC) method, in which the basic trackers communicate with one another interactively while run in parallel. Experimental results show that the proposed method tracks the object accurately and reliably in challenging realistic videos, and outperforms the state of-the-art tracking methods.

Lecture 2: Graph matching via Random Walks
Establishing feature correspondence between images lies at the heart of computer vision problems, and a myriad of feature matching algorithms have been proposed for a wide range of applications such as object recognition, image retrieval, and image registration. However, robust matching under non-rigid deformation and clutter is still a challenging open problem. Furthermore, most of conventional methods require some supervised settings or restrictive assumptions such as a reference image without severe clutter, a clean model of the target object, and one-to-one object matching between two images. In this talk, a graph-theoretic approach to robust feature matching will be introduced. Based on a random walk view on graph matching, image matching under non-rigid deformation and severe clutters is addressed and also extended to high-order image matching with high-level visual cues. Combining the method with a novel graph-based mode-seeking in a progressive framework, the proposed algorithms effectively solves the interconnected problems of robust feature matching, object discovery, and outlier elimination.

Weisi Lin, Nanyang Technological University, Singapore

Weisi Lin graduated from Zhongshan University, China with B.Sc and M.Sc, and from King's College, London University, UK with Ph.D. He researched in Zhongshan University, Shantou University (China), Bath University (UK), National University of Singapore, Institute of Microelectronics (Singapore) and Institute for Infocomm Research (Singapore). He served as the Lab Head and Acting Department Manager in Institute for Infocomm Research. He is now an associate professor in School of Computer Engineering, Nanyang Technological University, Singapore. He is a Chartered Engineer, a senior member of IEEE, a fellow of the IET and an Honorary Fellow of Singapore Institute of Engineering Technologists.

His areas of expertise include perception-inspired signal modeling, perceptual multimedia quality evaluation, video compression, and image processing and analysis; in these areas he has published 190+ refereed journal and conference papers, and been the Principal Investigator of more than 10 major projects (with both academic and industrial funding of over S$4m). He currently serves as the Associate Editor for IEEE Trans on Multimedia, IEEE Signal Processing Letters and Journal of Visual Communication and Image Representation.

He co-chairs the IEEE MMTC interest group on Quality of Experience (QoE), and has organized special sessions in IEEE ICME06, IEEE IMAP07, PCM09, SPIE VCIP10, IEEE ISCAS 10, APSIPA11, MobiMedia 11 and IEEE ICME 12. He gave invited/panelist/keynote/tutorial speeches in VPQM06, SPIE VCIP10, IEEE ICCCN07, PCM07, PCM09, IEEE ISCAS08, IEEE ICME09, APSIPA10, IEEE ICIP10, and IEEE MMTC QoEIG (2011). He is the Lead Guest Editor for the recent special issue on New Subjective and Objective Methodologies for Audio and Visual Signal Processing in IEEE Journal of Selected Topics in Signal Processing. He also maintains long-term partnership with a number of companies that are keen on perception-driven technology for audiovisual signal processing.


Lecture 1: Recent Development in Perceptual Visual Quality Evaluation
Quality (distortion) evaluation of images and video is useful in many applications, and also crucial in shaping almost all visual processing algorithms/systems, as well as their implementation, optimization and testing. Since the human visual system (HVS) is the final receiver and appreciator for most processed images and videos (be they naturally captured or computer generated), it would be beneficial to use a perceptual quality criterion in the system design and optimization, instead of a traditional one (e.g., MSE, SNR, PSNR, QoS). As a result of the evolution, the HVS has developed unique characteristics. Significant research effort has been made toward modelling the HVS' picture quality evaluation mechanism, and to apply the models to various situations. In this lecture, we will first introduce the major problems associated with perceptual visual quality metrics (PVQMs) (to be in line with the HVS perception), and the major research and development work so far in the related fields. Then, the two major modules in most current systems (i.e., feature detection and feature pooling) are to be highlighted and explored, based on the presenter's substantial project exposure. The lecture aims at providing an up-to-date overview and classification in perceptual quality gauging for images and videos. It will also give comparison and comments for the current research activities, with the presenter's understanding and experience in the said areas.

Lecture 2: Human-vision Friendly Processing for Images and Graphics
To make the machine perceive as the human vision does can result in resource savings (for instance, bandwidth, memory space, computing power) and performance enhancement (such as the resultant visual quality, and new functionalities), for both naturally captured images and computer generated graphics. Significant research effort has been made toward modelling the human vision mechanism during the past decade, and to apply the resultant models to various situations (image and video compression, watermarking, channel coding, signal restoration and enhancement, computer graphics and animation, visual content retrieval, etc.). The human vision system's characteristics can be turned into the advantages for system designs and optimization. In this talk, we will first introduce the major problems, difficulties and research efforts so far in the related fields. The basic engineering models (like signal decomposition, visual attention, eye movement, visibility threshold determination, and common artefact detection) are then to be discussed. Afterward, different perceptually-driven techniques and applications will be presented for visual signal compression, enhancement, communication, and rendering, with proper case studies. The last part of the lecture is devoted to a summary, points of further discussion and possible future research directions.

Helen Meng, The Chinese University of Hong Kong, Hong Kong SAR, China

Helen Meng received the S.B., S.M., and Ph.D. degrees, all in electrical engineering, from the Massachusetts Institute of Technology, Cambridge. She joined The Chinese University of Hong Kong in 1998, where she is currently Professor in the Department of Systems Engineering and Engineering Management. In 1999, she established the Human-Computer Communications Laboratory and serves as Director. In 2005, she established the Microsoft-CUHK Joint Laboratory for Human-Centric Computing and Interface Technologies and serves as Co-Director. This laboratory was conferred the national status of the Ministry of Education of China (MoE) Key Laboratory in 2008. Helen also served as an Associate Dean (Research) of the Faculty of Engineering from 2006 to 2010. She received the MoE Higher Education Outstanding Scientific Research Output Awards in Technological Advancements, for the area of "Multimodal User Interactions with Multilingual Speech and Language Technologies" in 2009. In previous years, she has also received the Exemplary Teaching Award, Service Award for establishment of the worldwide engineering undergraduate exchange program and Young Researcher's Award from CUHK Faculty of Engineering. In 2010, her co-authored paper received the Best Oral Paper Award from the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA). Her research interest is in the area of human-computer interaction via multimodal and multilingual spoken language systems, computer-aided language learning systems, as well as translingual speech retrieval technologies. She served as Editor-in-Chief of the IEEE Transactions on Audio, Speech and Language Processing between 2009 and 2011. She is also an elected Board Member of the International Speech Communication Association since 2007. Helen has been participating in the IEEE Speech Technical Committee for two terms and the program committees of Interspeech for multiple years. She will serve as the General Chair of ISCA SIG-CSLP's flagship conference - International Symposium on Chinese Spoken Language Processing (ISCSLP) in 2012 and the Technical Program Committee Chair of Interspeech 2014.


Lecture 1: Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English - the CUHK Experience
This talk presents an ongoing research initiative in the development of speech technologies that strives to raise the efficacy of computer-aided pronunciation training, especially for Chinese learners of English. Our approach is grounded on the theory of language transfer and involves a systematic phonological comparison between the primary language (L1 being Chinese) and secondary language (L2 being English) to predict possible segmental and suprasegmental realizations that constitute mispronunciations in L2 English. The predictions are validated based on a specially designed corpus that consists of several hundred hours of L2 English speech. The speech data supports the development of automatic speech recognition technologies that can detect and diagnose mispronunciations. The diagnosis aims to support the design of pedagogical and remedial instructions, which involves text-to-speech synthesis technologies for corrective feedback generation in audiovisual forms. This talk offers an overview of the technologies, related experimental results and ongoing work as well as future plans.

Lecture 2: Multimodal Processing in Speech-based Interactions
Speech constitutes the primary form of human communication and research in automatic speech processing has largely focused on the audio modality. However, human communication is inherently multimodal, which involves not only speech, but also expressions, gaze, gestures, posture, movement and position, etc. Much can be gained in terms of naturalness, performance and robustness in human-computer interaction, by mimicking the human capacity in jointly processing information available in multiple modalities. Multimodality in speech is a vastly interdisciplinary research area. This talk presents an overview of related activities, including audiovisual speech recognition and synthesis that incorporates information about the speaker's facial and lip motions, bimodal interfaces that support speech and pen gestural inputs for mobile computing, multi-biometric user authentication that incorporates voiceprints, fingerprints and face recognition, as well as co-processing of audio and visual information from multiple speakers in social signal processing. We will also present methods for multimodal fusion, i.e., the critical process of information integration across modalities. This talk concludes with a set of challenges for multimodal speech processing and suggests possible directions for future work.

Lecture 3: Modeling the Expressivity of Textual Semantics for Text-to-Audiovisual Speech Synthesis in Avatar Animation
This talk describes expressive text-to-speech synthesis techniques for a Chinese spoken dialog system, where the expressivity is driven by the message content. We adapt the three-dimensional pleasure-displeasure, arousal-nonarousal and dominance-submissiveness (PAD) model for describing expressivity in input text semantics. The context of our study is based on response messages generated by a spoken dialog system in the tourist information domain. We use the (pleasure) and (arousal) dimensions to describe expressivity at the prosodic word level based on lexical semantics. The (dominance) dimension is used to describe expressivity at the utterance level based on dialog acts. We analyze contrastive (neutral versus expressive) speech recordings to develop a nonlinear perturbation model that incorporates the PAD values of a response message to transform neutral speech into expressive speech. Two levels of perturbations are implemented-local perturbation at the prosodic word level, as well as global perturbation at the utterance level. Perceptual experiments indicate that the proposed approach can significantly enhance expressivity in response generation for a spoken dialog system. We also demonstrate that the approach can be generalized for visual speech prosody that include head motions and facial expressions.

Xiaokang Yang, Shanghai Jiao Tong University, China

Xiaokang YANG received the B. S. degree from Xiamen University, Xiamen, China, in 1994, the M. S. degree from Chinese Academy of Sciences, Shanghai, China, in 1997, and the Ph.D. degree from Shanghai Jiao Tong University, Shanghai China, in 2000.

He is currently a professor and the deputy director of the Institute of Image Communication and Information Processing, Department of Electronic Engineering, Shanghai Jiao Tong University, Shanghai, China. From August 2007 to July 2008, he visited the Institute for Computer Science, University of Freiburg, Germany, as an Alexander von Humboldt Research Fellow. From September 2000 to March 2002, he worked as a Research Fellow in Centre for Signal Processing, Nanyang Technological University, Singapore. From April 2002 to October 2004, he was a Research Scientist in the Institute for Infocomm Research (I2R), Singapore. He has published over 150 refereed papers, and has filed 35 patents. His current research interests include visual processing and communication, media analysis and retrieval, and pattern recognition.

He received National Science Fund for Distinguished Young Scholars in 2010, Professorship Award of Shanghai Special Appointment (Eastern Scholar) in 2008, the Microsoft Young Professorship Award in 2006, the Best Young Investigator Paper Award at IS&T/SPIE International Conference on Video Communication and Image Processing (VCIP2003) and awards from A-STAR and Tan Kah Kee foundations. He is currently a member of Editorial Board of IEEE Signal Processing Letters, Digital Signal Processing (Elsveier Press), a member of APSIPA, a senior member of IEEE, a member of Design and Implementation of Signal Processing Systems (DISPS) Technical Committee of the IEEE Signal Processing Society and a member of Visual Signal Processing and Communications (VSPC) Technical Committee of the IEEE Circuits and Systems Society. He was the special session chair of Perceptual Visual Processing of IEEE ICME2006. He was the technical program co-chair of IEEE SiPS2007 and the technical program co-chair of 3DTV workshop in junction with 2010 IEEE International Symposium on Broadband Multimedia Systems and Broadcasting.


Lecture 1: Visual quality assessment incorporating the knowledge from physiology, psychophysics and neuroscience
Perceptual quality assessment is an important research topic of visual signal processing both in its own right and for its utility in designing various optimal processing and coding algorithms. With the quick advances of vision related research in a broad area of physiology and psychology in the last decades, it is beneficial for us to incorporating biologically and neurologically inspired theories and models into the study of visual signal processing. In this talk, we will first review some classic biological and psychological models for image and video quality assessment. After that, some newly developed neurological theories for human perceptions, especially the free energy principle, will be introduced. And the free energy principle will be adapted for the task of image quality assessment. The performances of those biological, psychovisual and neurological models based image quality metrics will be briefly analyzed and commented. This talk emphasizes the importance, effectiveness and necessity of incorporating knowledge from physiology, psychophysics and neuroscience for the problem of visual signal processing.

Lecture 2: Smart video surveillance system in the context of Internet-of-Things
Video surveillance networks are increasingly deployed in public and private facilities, with tremendous potential value for public safety. It is not feasible to monitor thousands (even millions) of video sources manually. Huge volume surveillance imagery data is often simply directed to mass-storage devices, to be used only forensically. On-line automatic video analysis represents a new trend to smart video surveillance networks. In this talk, we will first overview the challenging issues on large scale smart video surveillance networks, and present new paradigm of smart video surveillance in the context of IoT. We then review the enable techniques including multimodal video processors for deeper sensing, high performance video coding and transmission schemes for ubiquitous connection, video analysis and retrieval techniques for intelligent services.

Thomas Fang Zheng, Tsinghua University, China

Dr. Thomas Fang Zheng is a full research professor and Vice Dean of the Research Institute of Information Technology (RIIT), Tsinghua University (THU), and Director of the Center for Speech and Language Technologies (CSLT), RIIT, THU.

Since 1988, he has been working on speech and language processing. He has been in charge of, or undertaking as a key participant, the R&D of more than 30 national key projects and international cooperation projects, and received awards for more than 10 times from the State Ministry (Commission) of Education, the State Ministry (Commission) of Science and Technology, the Beijing City, and others. So far, he has published over 200 journal and conference papers, 11 (3 for first author) of which were titled the Excellent Papers, and 11 books (refer to http://cslt.riit.tsinghua.edu.cn/~fzheng for details). He has been serving in many conferences, journals, and organizations.

He is an IEEE Senior member, a CCF (China Computer Federation) Senior Member, an Oriental COCOSDA (Committee for the international Coordination and Standardization of speech Databases and input/output Assessment methods) key member, an ISCA member, an APSIPA (Asia-Pacific Signal and Information Processing Association) member, a council member of Chinese Information Processing Society of China, a council member of the Acoustical Society of China, a member of the Phonetic Association of China, and so on.

He serves as Council Chair of Chinese Corpus Consortium (CCC), a Steering Committee member and a BoG (Board of Governors) member of APSIPA, Chair of the Steering Committee of the National Conference on Man-Machine Speech Communication (NCMMSC) of China, head of the Voiceprint Recognition (VPR) special topic group of the Chinese Speech Interactive Technology Standard Group, Vice Director of Subcommittee 2 on Human Biometrics Application of Technical Committee 100 on Security Protection Alarm Systems of Standardization Administration of China (SAC/TC100/SC2), a member of the Artificial Intelligence and Pattern Recognition Committee of CCF.

He is an associate editor of IEEE Transactions on Audio, Speech, and Language Processing, a member of editorial board of Speech Communication, a member of editorial board of APSIPA Transactions on Signal and Information Processing, an associate editor of International Journal of Asian Language Processing, and a member of editorial committee of the Journal of Chinese Information Processing.

He ever served as co-chair of Program Committee of International Symposium on Chinese Spoken Language Processing (ISCSLP) 2000, member of Technical Committee of ISCSLP 2000, member of Organization Committee of Oriental COCOSDA 2000, member of Program Committee of NCMMSC 2001, member of Scientific Committee of ISCA Tutorial and Research Workshop on Pronunciation Modeling and Lexicon Adaptation for Spoken Language Technology 2002, member of Organization Committee and international advisor of Joint International Conference of SNLP-O-COCOSDA 2002, General Chair of Oriental COCOSDA 2003, member of Scientific Committee of International Symposium on Tonal Aspects of Languages (TAL) 2004, member of Scientific Committee and Session Chair of ISCSLP 2004, chair of Special Session on Speaker Recognition in ISCSLP 2006, Program Committee Chair of NCMMSC 2007, Program Committee Chair of NCMMSC 2009, Tutorial Co-Chair of APSIPA ASC 2009, Program Committee Chair of NCMMSC 2011, general co-chair of APSIPA ASC 2011.

He has been also working on the construction of "Study-Research-Product" channel, devoted himself in transferring speech and language technologies into industries, including language learning, embedded speech recognition, speaker recognition for public security and telephone banking, location-centered intelligent information retrieval service, and so on. Now he holds over 10 patents in various aspects of speech and language technologies.

He has been supervising tens of doctoral and master students, several of who were awarded, and therefore he was entitled Excellent Graduate Supervisor. Recently, he received 1997 Beijing City Patriotic and Contributing Model Certificate, 1999 National College Young Teacher (Teaching) Award issued by the Fok Ying Tung Education Foundation of the Ministry of Education (MOE), 2000 1st Prize of Beijing City College Teaching Achievement Award, 2001 2nd Prize Beijing City Scientific and Technical Progress Award, 2007 3rd Prize of Science and Technology Award of the Ministry of Public Security, and 2009 China "Industry-University-Research Institute" Collaboration Innovation Award.


Lecture 1: Speaker Recognition Systems: Paradigms and Challenges
Speaker recognition applications are becoming more and more popular. However, in practical applications many factors may affect the performance of systems.

In this talk, a general introduction to speaker recognition will be presented, including definition, applications, category, and key issues in terms of research and application. Robust speaker recognition technologies that are useful to speaker recognition applications will be briefed, including cross channel, multiple speaker, background noise, emotions, short utterance, and time-varying (or aging). Recent research on time-varying robust speaker recognition will be detailed.

Performance degradation with time varying is a generally acknowledged phenomenon in speaker recognition and it is widely assumed that speaker models should be updated from time to time to maintain representativeness. However, it is costly, user-unfriendly, and sometimes, perhaps unrealistic, which hinders the technology from practical applications. From a pattern recognition point of view, the time-varying issue in speaker recognition requires such features that are speaker-specific, and as stable as possible across time-varying sessions. Therefore, after searching and analyzing the most stable parts of feature space, a Discrimination-emphasized Mel-frequency-warping method is proposed. In implementation, each frequency band is assigned with a discrimination score, which takes into account both speaker and session information, and Mel- frequency-warping is done in feature extraction to emphasize bands with higher scores. Experimental results show that in the time-varying voiceprint database, this method can not only improve speaker recognition performance with an EER reduction of 19.1%, but also alleviate performance degradation brought by time varying with a reduction of 8.9%.

Lecture 2: A domain-specific language understanding framework with its application in intelligent search engine
Compared with Google search services, vertical search is getting more and more popular nowadays in China, especially with rapid growth of the use of short messages.
Different from general web search engines which basically use keyword based techniques to provide information retrieval services, vertical search will narrow its application to a specific domain, such as travel, hotel, and shopping, so that semantic parsing technologies can be used to retrieve more specific and deep knowledge.

In this talk, a user-friendly easy-to-use SDK, named SDS Studio, which is a domainspecific language understanding framework, will be introduced, with detailed information on the design of a robust parser based on topic forest, a powerful dialogue manager, a keyword extractor, and a semi-automatic grammar writer and checker. By using this SDK, a developer is easy to setup a vertical search application very efficiently and precisely.

A Location-centered Services (LCS) application example will also be presented. This application was implemented with the above-mentioned SDS Studio, and under collaboration with China Mobile Co., the largest telecommunication company in China. The services in this application include food, restaurant, hotel/house-renting, transportation, sightseeing, entertainment, shopping (digital-products), and so on, and all such information is related to a location of interests of a user in the recent query. A digital map is used in the background for the location related services.