SCHEDULE
---------------------------------
Date
July 11 - 12, 2007

July 11
Time Speakers Title
9:00 - 9:05 Bin Yu Opening Remarks
9:05-9:15   A brief introduction of Pao-Lu Hsu and his work
9:15-10:25 Keynote,
Robert Schapire
Boosting: Theory and Applications
download PDF
10:25-10:40 Break
10:40-11:30 Tong Zhang Some improved boosting methods
download PDF
11:30-12:20 Bin Yu Group and Hierachical Sparsity through Composite Absolute Penalities
download PDF
12:20- 13:40 Lunch
13:40 -14:50 Keynote,
Jerome H. Friedman
Predictive Learning via Rule Ensembles
download PDF
14:50-15:10 Break
15:10-15:50 Alice Zheng Statistical Failure Diagnosis in Software and Systems
download PDF
15:50-16:30 Tie-Yan Liu Learning to Rank: From pariwise approach to Listwise approach
download PDF
16:30-17:20 Hang Li AdaRank: A Boosting Algorithm for Information Retrieval
download PDF
July 12
9:00 -10:10 Keynote,
Andrew McCullum
Bayesian Models of Social Networks and Text
download PDF
10:10-10:30 Break
10:30- 11:20 Partha Niyogi A Geometric Perspective on Machine Learning and Related Algorithms
11:20-12:00 Changshui Zhang Semi-Supervised Learning and Interactive Image Segmentation
download PDF
12:00-13:30 Lunch
13:30-14:20 Martin Wainwright Variational methods in Markov random fields: Stability and the benefits of choosing the "wrong" model
download PDF
14:20-15:00 Zhi-Hua Zhou Multi-Instance Learning Revisited
download PDF
15:00-15:20 Break
15:20-16:10 Alex Smola Hilbert Space representations and tests for distributions
download PDF
16:10-16:50 John Langford Tournament Algorithms for Reducing Multiclass Prediction to Binary Prediction
16:50-17:30 Tiefeng Jiang Approximation of Haar Distributed Matrices and Limiting Distributions of Eigenvalues of Jacobi Ensembles
download PDF

Venue
Peking University (see campus map)
Sciences Building No.4  (Life Science Building)
No.5 The Summer Palace Street,  Haidian District, Beijing.

Abstract

Statistical Failure Diagnosis in Software and Systems
Alice Zheng
Carnegie Mellon University

As software and systems become increasingly complex, the task of debugging also becomes increasingly difficult.  Manual diagnosis can require sifting through millions of lines of code and output logs.  In addition, large systems contain many components, each complex on its own, and often interacting in unexpected ways.

I present a case study illustrating how statistical machine learning algorithms, along with appropriate system instrumentation, can aid in failure diagnosis.  I propose a statistical software debugging framework that collects information from past successes and failures via fine-grained instrumentation of the program and then analyzes this information to locate suspicious program predicates.  I discuss the algorithmic challenges of the approach, and demonstrate a bi-clustering algorithm that is effective at simultaneously clustering failed runs and selecting useful predicates.  Using this approach, it took a programmer 20 minutes to find a long-standing bug in a real-world software program which he had never seen before.

This work is done in collaboration with Ben Liblit (U. Wisconsin, Madison), Michael Jordan (U.C. Berkeley), Alex Aiken and Mayur Naik (Stanford). 

 

Predictive Learning via Rule Ensembles
Jerome H. Friedman
Stanford University

General regression and classification models are constructed as linear combinations of simple rules derived from the data. Each rule consists of a conjunction of a small number of simple statements concerning the values of individual input variables. These rule ensembles are shown to produce predictive accuracy comparable to the best methods. However their principal advantage lies in interpretation. Because of its simple form, each rule is easy to understand, as is its influence on individual predictions, selected subsets of predictions, or globally over the entire space of joint input variable values. Similarly, the degree of relevance of the respective input variables can be assessed globally, locally in different regions of the input space, or at individual prediction points. Techniques are presented for automatically identifying those variables that are involved in interactions with other variables, the strength and degree of those interactions, as well as the identities of the other variables with which they interact. Graphical representations are used to visualize both main and interaction effects.

 

Tournament Algorithms for Reducing Multiclass Prediction to Binary Prediction
 
John Langford
Toyota Technological Institute at Chicago
 

I'll describe a new family of algorithms for reducing from multiclass classification to binary classification. The first member of this family is a single elimination tournament, as is often used in gaming competitions. Other members of this family correspond to double elimination and higher order elimination tournaments.

These reductions are provably consistent, implying that a zero regret binary classifier induces a zero regret multiclass classifier. They are also robust, implying that small errors in binary classification imply small errors in multiclass classification. The robustness is superior to all other known algorithms.

 

Variational methods in Markov random fields: Stability and the benefits of choosing the "wrong" model
Martin Wainwright
Department of Statistics, UC Berkeley

Markov random fields (MRFs) are widely used in different areas of statistics and computer science, including statistical machine learning, spatial statistics, statistical signal processing, and communication theory. Variational methods are a class of algorithms, either of an exact or approximate nature, for solving challenging computational problems that arise in applications of MRF models, including marginalization and computing likelihoods.

The first part of this talk will be a tutorial introduction to variational methods in Markov random fields, including mean field theory, belief propagation and extensions, and various higher-order relaxations. In the second part, we consider a problem of joint model estimation and prediction, in which a Markov random field is first estimated on the basis of one data set, and then used to perform prediction on new samples.

Here we demonstrate the remarkable fact that estimating the ``wrong'' model can be beneficial in the computation-limited setting.  Indeed, we prove that an inconsistent estimate yields better prediction performance than the true underlying model, even in the infinite data limit. We discuss various other contexts in which similar behavior is likely to occur when approximate algorithms are used.

 

A Geometric Perspective on Machine Learning and Related Algorithms
Partha Niyogi
University of Chicago

Increasingly, we face machine learning problems in very high dimensional spaces. We proceed with the intuition that although natural data lives in very high dimensions, they have relatively few degrees of freedom. One way to formalize this intuition is to model the data as lying on or near a low dimensional manifold embedded in the high dimensional space. This point of view leads to a new class of algorithms that are "manifold motivated" and a new set of theoretical questions that surround their analysis. A central construction in these algorithms is a graph or simplicial complex that is data-derived and we will relate the geometry of these to the geometry of the underlying manifold. Applications to embedding, clustering, classification, and semi-supervised learning will be considered.

 

Boosting: Theory and Applications
Robert Schapire
Princeton University

Boosting is a general method for producing a very accurate classification rule by combining rough and moderately inaccurate "rules of thumb."  While rooted in a theoretical framework of machine learning, boosting has been found to perform quite well empirically. In this talk, I will focus on the boosting algorithm AdaBoost, and explain the underlying theory of boosting, including our explanation of why boosting often does not suffer from overfitting, as well as recent developments in the ongoing debate surrounding this explanation. Some of the myriad other theoretical points of view that have been taken on AdaBoost will also be presented.  Finally, I will describe some recent applications of boosting.

 

Some improved boosting methods
Tong Zhang
Rutgers University & Yahoo

We consider the problem of fitting complex loss functions using an arbitrary regression weak learner. Our approach, extending gradient boosting, is based on greedy optimization of quadratic upper bounds (or approximation) of the loss functions. This idea allows us to present a rigorous convergence analysis of the resulting algorithms based on various results for greedy least squares methods. We compare a few methods, their convergence rates, as well as the underlying mechanism for complexity control.

 

Semi-Supervised Learning and Interactive Image Segmentation
Changshui Zhang
Tsinghua Univeristy

The recent years have witnessed a surge of interest in semi-supervised learning in machine learning community. And graph based semi-supervised learning has been becoming on of the hot research topics in semi-supervised learning (SSL) fields. Despite its empirical success, there are still some problems that are not properly resolved. In this talk, I will introduce our novel graph based SSL algorithm called Linear Neighborhood Propagation (LNP). Based on the assumption that each data point can be linearly reconstructed from its neighborhood, LNP can automatically get the similarities between pairwise points, and such similarities will be used to propagate the labels to the unlabeled points. We prove theoretically the convergence of our algorithm and also derive a simple way for using LNP to induction. To demonstrate the effectiveness of LNP, we apply it to interactive image segmentation, in which the user first give some hints on the foreground and background, and the LNP algorithm will be used to propagate these hints to the rest of the pixels. We will see that in this application LNP can do a perfect job and give us satisfactory results.

 

Multi-Instance Learning Revisited
Zhi-Hua Zhou
Nanjing University

 Multi-instance learning is a branch of machine learning which has been studied by many researchers during the past decade. Here the training set is composed of many bags each contains many instances; a bag is positive if it contains at least one positive instance and negative otherwise; although the labels of the training bags are known, the labels of the instances are unknown. In this talk, we will show that by assuming i.i.d. instances, which is an assumption explicitly or implicitly taken by most previous studies, multi-instance learning can be regarded as a special case of semi-supervised learning, which is another branch of machine learning, and we suggest that only i.i.d. bags should be assumed in future research of multi-instance learning.

 

Group and Hierachical Sparsity through Composite Absolute Penalities
Bin Yu
University of California at Berkeley

Extracting useful information from high-dimensional data is the focus of today's statistical research and practice. Penalized loss function minimization has been shown to be effective for this task both theoretically and empirically. With the vitures of both regularization and sparsity, the L1-penalized L2 minimization method Lasso has been popular.

In this talk, we combine different norms including L1 to form an intelligent penalty in order to add side information to the fitting of a regression or classification model to obtain reasonable estimates. Specifically, we introduce the Composite Absolute Penalties (CAP) family which allows the grouping and hierarchical relationships between the predictors to be expressed. The CAP framework covers and goes beyond existing works including grouped lasso and elastic nets.CAP penalties are built by defining groups and combining the properties of norm penalties at the across group and within group levels. Grouped selection occurs for non-overlapping groups.

In that case, we give a Bayesian interpretation for CAP penalties.Hierarchical variable selection is reached by defining groups with particular overlapping patterns.In the computation aspect, we propose using the BLASSO and cross-validation to obtain CAP estimates. For a subfamily of CAP estimates, we introduce the iCAP ($\infty$-CAP) algorithm to trace the entire regularization path for the grouped selection problem. Within this subfamily, unbiased estimates of the degrees of freedom (df) are derived allowing the regularization parameter to be selected with cross-validation. CAP is shown to improve on the predictive performance of the LASSO in a series of simulated experiments including those when $p>>n$ and when the grouping is wrong.  When the complexity of a model is properly calculated, iCAP is seen to be parsimonious as well in the experiments. 

 

AdaRank: A Boosting Algorithm for Information Retrieval
Hang Li
Microsoft Research Asian

In this talk, I will introduce a machine learning algorithm for constructing ranking functions in information retrieval, which we have developed recently. I will first give an overview on information retrieval using machine learning. I will explain the ranking models, evaluation criteria, and data collection methods widely used in information retrieval. The models include BM25, Language Model for IR, the general model of Learning to Rank. The evaluation criteria include NDCG (Normalized Discounted Cumulative Gain) and Kendall's tau. The data collection methods are explicit data labeling and implicit data labeling. I will discuss that Ideally a learning algorithm would train a ranking model that could directly optimize the performance measures with respect to the training data, in contrast to the fact that many existing methods were only able to train ranking models by minimizing loss functions loosely related to the performance measures.  I will next introduce our proposal on AdaRank, a boosting algorithm, which can minimize a loss function directly defined on the performance measures. I will also explain the theoretical results and empirical results we have obtained with regard to AdaRank. This is joint work with Jun Xu at MSRA.

 

Hilbert Space representations and tests for distributions
Alex Smola
University of Australia

Entropy and mutual information are popular methods for dealing with distributions. In this talk I will discuss alternative measures for computing distances between distributions based on Reproducing Kernel Hilbert Spaces. This leads to a family of methods ranging from  computationally efficient two sample tests over tests for independence of random variables, algorithms for feature selection, clustering, covariate shift correction and variants of maximum  variance unfolding. The talk will discuss several of those applications and algorithms arising from this setting.

 

Bayesian Models of Social Networks and Text
Andrew McCallum
University of Massachusetts Amherst

In this talk I will describe recent research at the intersection of information extraction, data mining and social network analysis.  In particular I will focus on how such a combination can be made both robust and scalable---showing that the typical brittle cascading of errors from text extraction to data mining can be avoided with unified probabilistic inference in graphical models, and showing that these models can be made efficient with recent methods of approximate inference and learning. After briefly introducing conditional random fields, I will demonstrate their use in joint models of extraction, entity resolution, and sequence alignment. I will then describe several methods of integrating textual data into a particular type of data mining---social network analysis.  In one model, we discover role-similarity between entities by examining not only network connectivity, but also the words communicated on on those edges; I'll demonstrate this method on a large corpus of email data.

In another model, we discover groups of entities and the "topical" conditions under which different groupings arise; I'll demonstrate this on coalition discovery from many years worth of voting records in the United Nations. I'll conclude with further examples of graphical models successfully applied to relational data, as well as discussion of their applicability to trend analysis, expert-finding and bibliometrics.

 

Learning to Rank: From pariwise approach to Listwise approach
Tie-Yan Liu
Microsoft Research Asia

The talk is concerned with learning to rank, which is to construct a model or a function for ranking objects. Learning to rank is useful for document retrieval, collaborative filtering, and many other applications. Several methods for learning to rank have been proposed, which take object pairs as `instances¨ in learning. We refer to them as the pairwise approach. Although the pairwise approach offers advantages, it ignores the fact that ranking is a prediction task on list of objects. This talk postulates that learning to rank should adopt the listwise approach in which lists of objects are used as `instances¨ in learning. We will mainly introduce an algorithm named ListNet, in which we propose a listwise loss function based on probability model. Experimental results on information retrieval show that the proposed listwise approach performs better than the pairwise approach.

 

Approximation of Haar Distributed Matrices and Limiting Distributions of Eigenvalues of Jacobi Ensembles
Tiefeng Jiang
University of Minnesota

I will first present tools to approximate the entries of a large dimensional real and complex Jacobi ensembles with independent complex Gaussian random variables. Based on this, we obtain the Tracy-Widom law of the largest singular values of the Jacobi emsemble. Moreover, the circular law, the Marchenko-Pastur law, the central limit theorem, and the laws of large numbers for the spectral norms are also obtained.

   
Copy Right© 2007
School of Mathematical Sciences, Peking University