In the past two decades, multimedia technology influences many aspects of our daily life. Besides biotechnology and nanotechnology, multimedia technology has been considered one of the three most promising industries of the twenty-first century. Multimedia research covers a broad scope of techniques and rich applications, including those working on music, video, image, text, and 3-D animation. In the upcoming few years, we would continue to devote our research efforts in advancing the key fields in multimedia, including multi-perspective computer vision, compressive sensing/ sparse representation, video forensics, etc. In what follows, we shall describe in details some key fields.
A. Video Forensics
Since the 911 attacks on the United States, counter-terrorism strategies have been given a high priority in many countries. Surveillance camcorders are now almost ubiquitous in modern cities. As a result, the amount of recorded data is enormous, and it is time-consuming to search the digital video content manually. In this next few years, we shall put part of our effort on video forensics, in which a major proportion of related research work is to perform mining for criminal evidence in videos recorded by a heterogeneous collection of surveillance camcorders. This is a new interdisciplinary field, and people working in the field need video processing skills as well as an in-depth knowledge of forensic science; hence the barrier for entering the field is high. Mining surveillance videos directly for criminal evidence is very different from conventional crime scene investigations. In the latter, detectives need to actually visit the crime scene, check all available details and collect as much physical evidence as possible. By contrast, to conduct crime scene investigations directly from surveillance videos, forensic experts need to develop software that facilitates the automatic detection, tracking, and recognition of objects in the videos. Since the videos are captured by heterogeneous camputer corders, to perform evidence mining on these videos is more challenging. We shall start by addressing the multiple-camera people counting problem as well as visual knowledge transfer among a heterogeneous collection of surveillance camcorders.
B. Compressive Sensing and Sparse Representation
Compressed Sensing/Sampling (CS) is a revolutionary technology of simultaneously sensing and compressing signals, and builds a new sampling theorem beyond the Nyquist rate. It enables to finish joint data acquisition and compression with slight cost at the encoder (for resource-limited mobile devices and sensors) but shift major computational overhead to the decoder. Based on the assumption of signal sparsity, CS, in theory, can perfectly reconstruct the original signal from (far) fewer measurements via convex optimization or greedy algorithms. This completely new idea makes CS a hot topic in signal processing-related fields since its first appearance in 2006. Furthermore, for the problems that are inherent sparse or can be sparsified, CS have been adopted in broad areas. Undoubtedly, this emerging area opens opportunities for the study of fundamental issues and application-oriented problems. In the future, we will plan to study the following topics: (1) Fast Compressed Image Sensing (CIS); (2) Fast Orthogonal Matching Pursuit (FOMP); (3) Multiple input systems exploiting sparse representation (e.g., microphone array signal processing); and (4) single-pass codeword learning for sparse representation.
C. Multi-perspective computer vision
Making computers capable of perceiving the real-world visual information from various clues is challenging because of highcomplexity conceptions, changing environments, free motion, high articulations, and so on. As many visual concepts are difficult to be summarized in simple and plain rules, (statistical) machine learning has played an important role in the past decade (as witnessed in the main conferences such as CVPR, ICCV, and NIPS), and is still expected to be vital to the progress of computer vision. Besides, due to the considerable growing of data amount in the Internet age, training in large-scale (and possibly noisy) datasets becomes a significant issue. Furthermore, instead of observing the world only with color images in common viewing angles, 3D imaging (providing further depth information) and flying camera (providing more un-common viewing angles from bird eye views) could also bring us chances for developing novel applications in the near future. High-level visual concepts, such as aesthetics, have also been shown the possibility of being tackled by machine learning. To address the above issues, we will study several topics toward understanding visual information from multi-perspectives: (1) object detection, recognition, and segmentation from visual saliency, (2) tracking and interacting with flying cameras, (3) on-line aesthetic value assessment when shooting, and (4) deriving the 3D structure of conventional camera images. The research outcomes are expected to be helpful in making computers understand human intension, assisting human with better-quality and more-safety life, and supporting robot to see and understand the world better.