Fundamentals of extracting information from data, machine learning, pattern recognition, fundamentals of databases and high-performance computing, as well as applications of data science to real problems.
Machine learning algorithms for making inferences on networks and answering questions in biology and medicine
An important idea that has emerged recently is that a cell can be viewed as a set of complex networks of interacting bio-molecules and genetic disease is the result of abnormal interactions within these networks. From a computational point of view, a central objective for systems biology and medicine is therefore the development of methods for making inferences and discovering structure in biological networks. In this course, I will introduce important concepts and algorithms from network science and machine learning – particularly semi-supervised learning, unsupervised learning and recommender systems – for the analysis and inference of large-scale networks. I will then show how these ideas have been exploited by recent machine learning algorithms for solving a variety of different problems in systems biology and medicine, including: the prediction of protein function; the prediction of human protein complexes; the prediction of disease genes for hereditary disease; the prediction of drug side effects. This last algorithm, which uses ideas from recommender systems that had been used earlier for recommending movies on Netflix, is the first that can predict the frequency of drug side effects in the population.
Data has been the number 1 fast growing phenomenon on the Internet for the last decade. Big data demands both high performance computing and elastic utility driven computing. Big data analytics holds the potential to reveal deep insights such as social influence among customers by analyzing business transactions, user-generated feedback ratings, social and geographical data. In the past 40 years, data was primarily used to record and report business activities and scientific events, and in the next 40 years data will be used also to derive new insights, to influence business decisions and to accelerate scientific discovery. One of the key challenges is to provide the right platforms and tools to make reasoning of big data easy and simple. Another key challenge is to revolutionize the ways of collecting, processing and analyzing the massive data that exceeds the processing capacity of existing computing systems. Big data education should cover big data systems, big data algorithms, big data technology, big data programming, big data applications from both research and development perspectives. This short course reviews concepts, techniques, algorithms and systems issues in big data education and research, with strong emphasis on systems and analytics, and explores big data opportunities from a variety of science and engineering applications, and examine various research problems and challenges that are critical for developing big data systems and big data applications. Main topics to be covered include but not limited to: fundamentals of data storage systems and optimizations, fundamentals of data mining and knowledge discovery, fundamentals of big data aware computing systems and software design, fundamentals of cluster computing and distributed file systems, fundamentals of geographically distributed data intensive systems. We will also cover big data applications that pose new challenges to big data systems and analytics, such as healthcare, mobile commerce, social media, Internet of Things, software defined computing, cyber manufacturing, cyber-physical systems, to name a few. This short course is designed to provide the fundamental training for big data scientists from high performance big data computing systems, to big data applications and big data analysis and management algorithms, and to look beyond the present status of the Big Data and conjecture what possible future technologies and applications will evolve.
We offer Data Science for Social Good (DSSG) projects that are designed to bring together undergraduate and graduate students from all academic backgrounds to work on interdisciplinary data-rich research projects that have the potential to benefit society. In the first half of this course, we will give an overview of the DSSG program and present three projects in greater details. The first one is a “smart city” project on scraping the internet to uncover the hidden universe of secondary rental units in the City of Surrey. The second one is on analyzing riders’ satisfaction of public transit. The final one is a public health project on automated classification of laboratory test results for disease control. In the second half of the course, we will focus on one set of skills that are common to all three projects – which is natural language processing. We will give an overview of methods for extracting features, topic modeling and sentiment analysis.
While many sophisticated models are developed for visual information processing, very few pay attention to their usability in the presence of data quality degradations. Most successful models are trained and evaluated on high quality visual datasets. On the other hand, the data source often cannot be assured of high quality in practical scenarios. For example, video surveillance systems have to rely on cameras of very limited definitions, due to the prohibitive costs of installing high definition cameras all around, leading to the practical need to recognize objects reliably from very low resolution images. Other quality factors, such as occlusion, motion blur, missing data and bad weather conditions, are also ubiquitous in the wild. The shor course will present a comprehensive and in depth review, on the recent advances in the robust sensing, processing and understanding of low quality visual data, using deep learning methods. I will mainly show how the image/video restoration and the visual recognition could be jointly optimized as one pipeline. Such an end to end optimization consistently achieves the superior performance over the traditional multi stage pipelines. I will also demonstrate how deep learning approach largely improves a number of real world applications.
Cloud computing providers often run their workloads on warehouse-scale data centers. These infrastructures are not only huge, but also highly heterogeneous, since there is a constant flow of new machines being added, and old ones being retired. Moreover, the workloads that they run are also heterogeneous, ranging from long-lived user-facing services, to the short-lived batch jobs typically used in data science activities. Not surprisingly, scheduling tasks in these systems is far from being trivial. In this short course, we will start by understanding the main challenges involved in the scheduling of time varying heterogeneous workloads in warehouse-scale data centers. We will have a deeper look at how these large data centers are structured, and what are the typical workloads that they execute. Then, we will describe different schedulers that have been proposed in the literature, and place special attention on the schedulers that currently used in large cloud providers. We conclude the course with a discussion on innovative ways to increase the efficiency of such schedulers.
The exponential growth of digital data sources enabled by the IoT, coupled with the ubiquity of non-trivial computational power, at the edges, in the core and in-between, for processing this data have the potential for fundamentally transforming our ability to understand and manage our lives and our environment. However, despite tremendous advances in technology this vision remains largely unrealized while our capacity for generating data is expanding dramatically, our ability for managing, manipulating and analyzing this data, for transforming it into knowledge and understanding in a timely manner, and for integrating it with practice has not kept pace. In this short course I will explore computing in the continuum a paradigm that opportunistically leverages loosely connected resources and services to process data in-situ and in-transit, to extract timely insights that are actionable. Using examples from our work as part of the CometCloud project, I will present research challenges and some initial solutions towards realizing this paradigm.
High-fidelity computational science and engineering (CSE) applications in high performance computers are costly. They often involve the selection of many computational parameters and options. The set-up of such parameters can be a cumbersome task and there is no guarantee that they will lead to a successful simulation. Usually, this is a trial-and-error process even for experienced users. Tracking at runtime some quantities of interest from output files is the regular procedure and, whenever possible, computations are halted, using checkpoint/restart procedures to resume with a new set of parameters or resubmitting the job to the queue. For large-scale problems, this computational enviroment involves saving a considerable amount of raw data in persistent storage. In this short course, I will cover the following issues: (i) preprocessing and mesh generation; (ii) time stepping, saving data on disk when required; and (iii) post-processing, typically visualizing the data generated by the simulation and extracting relevant information on the quantities of interest.
Error estimation is a broad and poorly understood topic that reaches all research áreas using pattern classification. It includes model-based approaches and discussions of newer error estimators such as bolstered and Bayesian estimators. This subject is motivated by the application of pattern recognition to high-throughput data with limited replicates, which is a basic problem now appearing in many areas. This short course covers basic issues in classification error estimation, such as definitions, test-set error estimation, and training-set error estimation, as well as the performance and representation of training-set error estimators for various pattern classifiers. Other topics that will be covered are: the latest results on the accuracy of error estimation; performance analysis of re-substitution, cross-validation, and bootstrap error estimators using analytical and simulation approaches.
Next generation sequencing (NGS) data can provide an effective approach for identifying causal variants in Mendelian disease genes. In this regard, whole exome sequencing (WES) focuses on the most informative regions of the genome by scanning thousands of genes simultaneously. Though such data can be crucial to investigate the molecular basis of genetic disorders in research and clinical diagnostics set-ups, the amount of variants found can be overwhelming, and thus there must be a strategy – manual, automatic or both – to extract the most of the data and identify the underlying pathogenic or likely pathogenic variants. This short course is meant to present the fundamentals for processing raw NGS files, as well as identifying and interpreting common sequence variations in the human genome known as single nucleotide polymorphisms (SNPs). The goal is to gain an understanding of the genetic pathogenicity mechanisms through automatic and manual data analyses.
This short course will present an overview of recently developed methods for causal inference in Economics and Social Sciences. Emphasis will be given to methods that exploit large datasets and data-driven model selection techniques.
In this era of large sky surveys and massive datasets, it often happens that intermediate data products developed for a particular purpose can be used for a variety of other projects. In this short course a collection of problems and technics related to big data sets in astronomy will be presented and discussed.