The ever-increasing size and complexity of modern datasets in science and technology pose many challenges for statistical inference, while classical statistical analysis has lagged behind. Among these challenges are fundamental questions related to the interplay and tradeoffs between statistical and computational requirements, which emerge due to the presence of complex underlying combinatorial structures, as is in contemporary applications, such as neuroscience, functional genomics, and social networks. In classical statistical inference, a primary goal is to characterize how much data is needed for various inference tasks with little restrictions on computational aspects. Accordingly, for a century, the focus has been on information-theoretic limits. However, the establishment of ``big-data" as the new norm is causing a conceptual shift in statistics: computational power is the new bottleneck, not the lack of observations.
There appears to be a tradeoff: the amount of data needed by computationally efficient algorithms may be significantly higher than what is needed by computationally infeasible algorithms. Despite the recent progress in understanding the statistical-computational tradeoffs exhibited in several contemporary inference problems with planted/hidden structures, this tradeoff is still poorly understood. Integrating both statistical and computational perspectives, the aim in this project is to advance the understanding of the statistical and algorithmic limits of several contemporary unsupervised inference problems, including detection/recovery of general structures planted in random graphs and matrices, and interactive learning of network properties.
Statistically Solvable & Computationally Hard
Typical Phase Diagram
Political blogosphere [AG'05]
Protein-Protein Interaction [PHJ'06]
Stochastic Block Model [HLL'83]
Planted Clique [Jerrum'92]
Misinformation Propagation & Privacy in Social Systems
Modern social media platforms play an important role in facilitating rapid dissemination of information through massive user networks, and have surpassed traditional news outlets to become the main source of news for most people in the United States. The ease of posting and sharing news, coupled with recent advances in AI technology, also facilitate the propagation of rumors, misinformation, and even maliciously fake information, sometimes with major political and societal ramifications. It is therefore of much interest to develop efficient techniques for combating such phenomena, with as little impact on user privacy as possible. In fact, recommendation algorithms that are typically used by Google, Facebook, etc., prioritize viral/"hot" content that received a considerable amount of attention (e.g., posts that were liked or shared), independently of its credibility; therefore they are accelerate the propagation of misinformation. Current efforts to combat online misinformation fall broadly into one of three categories: content control, transparency, and punishment. Accordingly, existing solutions attempt actively intercept either fake data (which is technically or morally difficult) or the users spreading it (which is privacy-invading). These solutions are hence focused on the problem symptoms rather than on the systemic issues that allow misinformation to quickly spread in the first place. We suggest flipping this conventional wisdom on its head: Using novel statistical modelling techniques, we strive to characterize elements of network structure and management that facilitate and precipitate the dissemination of false information. We shall then exploit the resulting insights in order to suggest new methodologies that could inherently curb such phenomena, in a privacy-respecting fashion.
Misinformation (yellow/brown) spreads within the healthy (blue) Twittersphere network, provided by Hoaxy [SCFM'16]
Clustering is one of the most important and basic tasks in unsupervised learning and data mining. It refers to the process of grouping a set of elements such that that elements in the same cluster are more "similar" to each other than to those in other clusters. Clustering imposes many fundamental challenges. First, is the problem of under-specificity; in the absence of a domain knowledge the solution of choice varies between different potential applications significantly. Second, is the lack of a benchmark to validate the clustering output. Finally, solving most of the common optimization formulations of clustering (such as, K-means) is NP-hard in general. To overcome these inherent difficulties, recently, semi-supervised models of clustering that allow active querying during data segmentation have become quite popular. This includes active learning, as well as data labeling by amateurs via crowdsourcing. Clever implementation of interactive querying framework can improve the accuracy of clustering and help in inferring labels of large amount of data by issuing only a small number of queries. Interactions can easily be implemented (e.g., via captcha), especially if queries involve few data points, like pairwise queries of whether two points belong to the same cluster or not. Our aim in this project is to study interactive clustering algorithms from several new directions. We would like to develop clustering models that capture the properties of real networks. for example by taking into account noisy and adversarial interactions, different similarity measures, and various random clusterability models. In particular, the questions that interest me revolve around the computational, statistical, and query complexity trade-offs, as well as developing algorithms for approximate clustering.