Welcome Message | Technical Program | Authors' Index | Venue&Event | Committee | Sponsor | Search

INVITED SESSION

KEYNOTE SPEECHES

Automatic Speech Recognition: Trials, Tribulations and Triumphs

Sadaoki Furui
Department of Computer Science
Tokyo Institute of Technology, Japan
President, APSIPA

Biodata:

SADAOKI FURUI received the B.S., M.S., and Ph.D. degrees from Tokyo University, Tokyo, Japan in 1968, 1970, and 1978, respectively.  After joining the Nippon Telegraph and Telephone Corporation (NTT) Labs in 1970, he has worked on speech analysis, speech recognition, speaker recognition, speech synthesis, speech perception, and multimodal human-computer interaction.

From 1978 to 1979, he was a visiting researcher at AT&T Bell Laboratories, Murray Hill, New Jersey.  He was a Research Fellow and the Director of Furui Research Laboratory at NTT Labs and is currently a Professor at the Department of Computer Science, Tokyo Institute of Technology.

He has authored or coauthored over 900 published papers and books including "Digital Speech Processing, Synthesis and Recognition." He has received the Paper Award and the Achievement Award from the Institute of Electronics, Information and Communication Engineers of Japan (IEICE) (1975, 88, 93, 2003, 2003, 2008), and the Paper Award from the Acoustical Society of Japan (ASJ) (1985, 87). He has received the Senior Award and Society Award from the IEEE SP Society (1989, 2006), the International Speech Communication Association (ISCA) Medal for Scientific Achievement (2009), and the IEEE James L. Flanagan Speech and Audio Processing Award (2010). He has also received the Achievement Award from the Minister of Science and Technology and the Minister of Education, Japan (1989, 2006), and the Purple Ribbon Medal from Japanese Emperor (2006).

Abstract:

Automatic speech recognition (ASR) technology has made remarkable progress over the last 20-30 years. ASR represents the state-of-the art in terms of simulating some aspects of human cognition and, although ASR systems are yet imperfect, it is quite impressive that they work as well as they do. There remain however, many difficult issues and challenges at every level. In most ASR applications, computers still make 5-10 times more errors than human subjects. One of the most significant differences exists in that human subjects are far more flexible and adaptive than machines in response to various variations of speech, including individuality, speaking style, additive noise, and channel distortions. For this reason, there are as yet only a handful of good applications, and they are generally limited in terms of their domain, and the conditions they may be used under. How to train and adapt statistical models for ASR using limited amounts of data is one of the most important research issues. Future systems need to have an efficient way of representing, storing, and retrieving various knowledge resources required for natural spoken conversation.

Information Anti-Forensics

K. J. Ray Liu
Department of Electrical and Computer Engineering
University of Maryland, College Park, USA
President - Elect, IEEE Signal Processing Society

Biodata:

Dr. K. J. Ray Liu was named a Distinguished Scholar-Teacher of University of Maryland in 2007, where he is Cynthia Kim Eminent Professor of Information Technology in Electrical and Computer Engineering Department. He leads the Maryland Signals and Information Group conducting research encompassing broad aspects of wireless communications and networking, information forensics and security, multimedia signal processing, and biomedical engineering.

Dr. Liu is the recipient of numerous honors and awards including the 1994 National Science Foundation Presidential Young Investigator Award; best paper awards from IEEE and EURASIP; IEEE Signal Processing Society 2004 Distinguished Lecturer; EURASIP 2004 Meritorious Service Award; and 2009 IEEE Signal Processing Society Technical Achievement Award. A Fellow of the IEEE and AAAS, he is recognized by Thomson Reuters as an ISI Highly Cited Researcher. Dr. Liu is President-Elect of IEEE Signal Processing Society. He was the Editor-in-Chief of IEEE Signal Processing Magazine and the founding Editor-in-Chief of EURASIP Journal on Advances in Signal Processing.

Dr. Liu also received various research and teaching recognitions from the University of Maryland, including Poole and Kent Senior Faculty Teaching Award and Outstanding Faculty Research Award, both from A. James Clark School of Engineering; and Invention of the Year Award from Office of Technology Commercialization.His recent books include Behavior Dynamics in Media-Sharing Social Networks, Cambridge University Press, 2011; Cognitive Radio Networking and Security: A Game Theoretical View, Cambridge University Press, 2010; Handbook on Array Processing and Sensor Networks, IEEE-Wiley, 2009; Cooperative Communications and Networking, Cambridge University Press, 2008; Resource Allocation for Wireless Networks: Basics, Techniques, and Applications, Cambridge University Press, 2008; Ultra-Wideband Communication Systems: The Multiband OFDM Approach, IEEE-Wiley, 2007; Network-Aware Security for Group Communications, Springer, 2007; Multimedia Fingerprinting Forensics for Traitor Tracing, Hindawi, 2005.

Abstract:

The rise of the Internet coupled with the widespread availability of digital cameras and audio recorders has created an environment in which society relies heavily on digital multimedia content to communicate information.  Because digital content can easily be manipulated using editing software, a number of forensic techniques have been developed to verify the authenticity of digital multimedia content.   By contrast, very little consideration has been given to anti-forensic operations capable of disguising evidence of tampering in digital multimedia content.  The study of anti-forensics is critically important to digital information security and assurance because it makes security researchers aware of vulnerabilities in existing forensic techniques which digital forgers may attempt to secretly exploit.  By investigating and analyzing digital anti-forensic techniques, researchers may be able to develop methods capable of detecting when an anti-forensic operation has been used to modify digital multimedia content. In this talk, we will discuss a variety of anti-forensic operations, in particular several anti-forensic techniques designed to remove evidence of an image’s compression history.  Using these techniques we are able to perform a number of image tampering operations such as double compression, cut-and-paste image forgery, and image origin falsification, and render them forensically undetectable.

Perspective of wireless broadband technology

Longming Zhu
CTO of Standardization,
ZTE Corporation,
Beijing

Biodata:

Zhu Longming, graduated from Info. Phy. Dep., Nanjing University in 1989. He engaged in telecommunication equipment design in Nanjing P&T Corp. during 1989-1995, and CDMA development in Mobile Telecommunication R&D Center of MPT during 1996-1998. From 1998 to now, he worked on system design and standardization in ZTE Corp. He has been member of expert working group of national science and technology major project, new generation broadband wireless mobile telecommunication network since 2006. The projects he engaged in were awarded by Shenzhen Gov., Guangdong Gov., and State Central Gov.

Abstract:

The report will review and analyze some hot technology on 4G, LTE-LAN, C-RAN etc., and brief the standardization progress of the hot technology.

The following six tutorial sessions will be held on October 18, 2011. Tutorials 1-3 will be held in the morning, while Tutorials 4-6 will be held in the afternoon.

Morning session (9.00 am to 12.20 pm)

Tutorial 1: Three Emerging Image Models: Theory, Comparison and Application Speaker: Prof. C.-C. Jay Kuo Signal and Image Processing Institute, University of Southern California, USA Abstract: Three new image models have received a lot of attention in the last decade. First, in the context-based image model, we classify image pixels into multiple classes depending on its neighboring pixel distribution, which is called a context. The famous non-local mean (NLM) denoising algorithm is built upon this model. Second, by exploiting the strong spatial and temporal correlations of image/video, the sparse representation model has been developed and applied to image processing applications such as image denoising, image deblurring, image inpainting and super-resolution. Finally, the total variation model is effective in categorizing and decomposing signals of different spatial frequencies in a given image region. In this tutorial, I will give an overview of them, explain their strengths and weaknesses, and propose ways to integrate them to result in the best possible performance.

Tutorial 2: A Tutorial on Deep Learning for Signal and Information Processing Speaker: Prof. Li Deng Principal Researcher, Microsoft Research, Redmond, WA, Affiliate Professor in the Department of Electrical Engineering at University of Washington, USA Abstract: Today, signal processing research has a significantly widened scope compared with just a few years ago, and machine learning has become an important technical area of our signal processing community. Since 2006, deep learning—a new area of machine learning research—has emerged, impacting a wide range of signal and information processing work within the traditional and the new, widened scopes. Various workshops, such as the 2011 ICML Workshop on Learning Architectures, Representations, and Optimization for Speech and Visual Information Processing, the 2009 ICML Workshop on Learning Feature Hierarchies, the 2008 NIPS Deep Learning Workshop, the 2009 NIPS Workshop on Deep Learning for Speech Recognition and Related Applications, as well as an upcoming special issue on Deep Learning for Speech and Language Processing in IEEE Transactions on Audio, Speech, and Language Processing (2011), have all been devoted exclusively to deep learning and its applications to various classical signal processing areas. We have also seen the government sponsor research on deep learning (e.g., the DARPA deep learning program). The main purpose of this tutorial is to review the basic mathematical and machine learning framework, and to introduce the audience to the emerging technologies enabled by deep learning. I will also review the research work conducted in this area since the birth of deep learning in 2006, especially the work that is of direct relevance to signal processing. Future research directions will be discussed from my personal perspective. It is hopeful that this tutorial can attract interests from signal processing researchers, students, and practitioners in this emerging area for advancing signal and information processing technology and applications in the future.

Tutorial 3: Computer-assisted Language Learning (CALL) based on Speech Technologies Speakers: Prof. Tatsuya Kawahara, Prof. Nobuaki Minematsu Kyoto University, Japan,University of Tokyo, Japan Abstract: The advancement of speech and language technologies has opened new perspectives on computer-assisted language learning (CALL) systems, such as automatic pronunciation assessment and dynamic conversational-style lessons. CALL is also regarded as one of new and promising applications of speech analysis and recognition. CALL covers a variety of aspects including segmental, prosodic and lexical features. Modeling non-native speech to correctly segment/recognize utterances while detecting errors included in them poses a number of issues to be solved. Assessing intelligibility of non-native speech or proficiency of non-native speakers is also an important issue. In this tutorial, we will give an overview on these issues and current solutions, with presentation of practical CALL systems.

Afternoon session (14.00 pm - 17.20 pm)

Tutorial 4: Progress towards Three-dimensional Television Speaker: Prof. Yo-Sung Ho Gwangju Institute of Science and Technology, Korea Abstract: In recent years, various multimedia services have become available and the demand for three-dimensional television (3DTV) is growing rapidly. Since 3DTV is considered as the next generation broadcasting service that can deliver real and immersive experiences by supporting user-friendly interactions, a number of advanced three-dimensional video technologies have been studied. Among them, multi-view video coding (MVC) is the key technology for various applications including free-viewpoint video (FVV), free viewpoint television (FVT), 3DTV, immersive teleconference, and surveillance systems. In this tutorial lecture, we are going to cover the current state-of-the-art technologies for 3DTV. After defining the basic requirements for realistic broadcasting services, we will cover various multi-modal immersive media processing technologies. We also compare two different approaches for 3DTV, depth-based and multi-view camera systems, and discuss a hybrid camera system implementation combining both approaches.

Tutorial 5: Audio Projection: Directional Sound and Its Applications in Immersive Communication Speakers: Prof. Woon-Seng Gan School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore Abstract: In this tutorial, we present the concept behind how directional audio beam can be projected to a listening zone via a directional loudspeaker. We highlight how digital signal processing techniques can be introduced in the directional loudspeaker to enhance the quality and sound pressure level of the audible sound, and the principle behind how directional sound beam can be digitally steered to different listening zones. In addition, we examine the needs of psychoacoustic processing to enhance the 3D spatial perception and audio quality of the directional loudspeaker. We will also review some of the new works that are currently being carried out by different research groups in Asia. Finally, the significance in using parametric loudspeakers in immersive communication will be described, and we will put forward some research challenges in directional loudspeaker and how signal processing techniques may help to overcome these challenges.

Tutorial 6: Facial Image Analysis for Video Surveillance Speaker: Prof. Kenneth K.M. Lam Centre for Signal Processing, Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong Abstract: The analysis of facial images is a challenging task, as a human face may vary in appearance due to facial expressions, aging, perspective variations, lighting conditions, etc. The major technologies involved include human face detection, facial-feature detection and extraction, human face recognition, 3D face reconstruction, facial-expression analysis and recognition, facial-image super-resolution, etc. These enabling technologies make possible many different applications, such as security systems, surveillance systems, face animation, human–machine interface systems, video-conferencing, and so on. In this tutorial, we will focus on the important techniques used for video surveillance. Human faces are the most important objects in video surveillance, and the faces in the videos are usually of low resolution. Consequently, in order to accurately detect and identify people in videos, a number of key technologies are required. All the faces in a video must first be detected accurately in real time. These detected faces are of poor quality and low resolution. Hence, their resolution must be increased, i.e. facial-image super-resolution is performed, and the 3D information about faces on videos can be recovered, so that accurate face recognition can ultimately be achieved. In this tutorial, we will highlight the technologies relating to facial image analysis, and address applications and improved systems based on these technologies. The techniques for face detection, facial-image super-resolution, and facial-feature extraction for face recognition, respectively, will then be presented. From this tutorial, the audience will learn the following techniques: 1. Algorithms for real-time face detection with the use of the AdaBoost algorithm; 2. Different techniques for facial-image super-resolution: patch-based, region-based, and holistic-based approaches using eigen transformation and kernel methods; 3. Efficient algorithms for the reconstruction of 3D faces from face images under different poses, and 4. Efficient algorithms for facial-feature extraction and for the use of features for face recognition.

Overview Session A

Overview Session B