Specific issues on learning from data, information retrieval, data science, open science and big data.
Despite the great interest of the academy and the industry around issues related to Data Science, it is widely recognized that no less than 80% of the time and effort in projects in this area are spent with tasks associated with the preparation of the data to be analyzed. In fact, tasks such as data collection, extraction, deduplication, and integration, while crucial in the process, are little related to Data Science activities typically associated to Data Scientists, such as analysis, pattern mining, model generation, etc. At this moment, in which new aspects such as ethics, legal compliance, scientific reproducibility, data quality, and algorithmic bias are emerging decisively, these efforts tend to increase even more. In this talk, I want to discuss how typical Data Engineering methods, techniques, and tools can help reduce this effort so that the magic promised by Data Science will not fail to happen because of the absurdity of raw data. As a concrete example, I will present some recent results of my research related to methods to enable sentiment analysis in opinative texts written by users.
Learning From Data is the descriptive term for Machine Learning, Data Science, Data-Driven Discovery, and the bulk of Artificial Intelligence. The field has undergone steady growth since the 1980’s, culminating in a phenomenal revolution over the past 5 years. At this point, the field is the most important scientific front among all disciplines, and its impact has broad reach in all facets of life. So, what happened? It is important to understand in specific terms what was achieved and how it was achieved. Part of this is well understood, and part of it remains a mystery even among specialists in the field. In this lecture, I will cover this topic in detail, including an assessment of where we are and what challenges and opportunities lie ahead.
The production of machine learning models share a common life-cycle involving: data pre-processing; training; validation; publication. Recently, in order to help data scientist to focus on the problem they are solving a series of systems have been implemented covering some aspect of the complete life-cycle, such as: SystemML; ModelDB; ModelHUB; DataHub; TFX, etc.. In order to deploy an environment to support the full ML life-cycle, one must understand the role of each module and the challenges to use then in an integrated fashion. In this talk, I will present the ML life-cycle and the requirements for these component systems. Next, I will briefly discuss individual functionality and the problems raised when put them together.
The term “Open Science” regards the wide dissemination of all material associated with scientific discovery, thereby contributing to the advancement of science. Examples of such material include, for instance, specifications of experiments and equipment, documents, methods, algorithms, or all kinds of data – in particular digital data. Data, here, should be considered in a broader context, including a wide variety of digital content, such as code, models, video, sound, spreadsheets, executable workflows. The adequate sharing of such data has become a key issue in the Open Science movement, and is now considered part of good scientific practices. Data sharing demands appropriate planning of the data lifecycle – from data collection to storage and preservation.This, in turn, poses many scientific challenges – for computer scientists, but also for the scientists whose data have to be managed. Here, data volume is just one issue to be tackled; data heterogeneity, quality and curation, preservation and retrieval are also of foremost importance, thus opening new research fronts for data scientists. This lecture will discuss some of the challenges in Research Data Management, in particular those posed by the Open Science movement, which require that data be shareable and accessible, while at the same time preserving integrity and privacy.
Computing technology and the development of remote sensing devices lead to a huge increase in the data sets (from observational analysis and from high resolution model simulations). This lecure will cover both aspects on the data revolution in Atmospheric Sciences and how big data in used in projects involving weather, climate and the societal applications.