10 October 2017: Scheduled talks for 'Modern data management' session

Organizers:

Title: Querying and Exploring Big Scientific Data

Abstract

Today's scientific processes heavily depend on fast and accurate analysis of experimental data. Scientists are routinely overwhelmed by the effort needed to manage the volumes of data produced either by observing phenomena or by sophisticated simulations. As database systems have proven inefficient, inadequate, or insufficient to meet the needs of scientific applications, the scientific community typically uses special-purpose legacy software. With the exponential growth of dataset size and complexity, application-specific systems, however, no longer scale to efficiently analyze the relevant parts of their data, thereby slowing down the cycle of analyzing, understanding, and preparing new experiments.

In this talk I will illustrate the problem with a challenging application featuring brain simulation data and will show how the problems from neuroscience translate into interesting data management challenges. Finally, I will also use the example of neuroscience to show how novel data management and, in particular, spatial indexing and navigation have enabled today's neuroscientists to simulate a meaningful percentage of the human brain.

Speaker

Thomas Heinis, PhD, is a Lecturer in Computing/Data Management at Imperial College London since September 2014 leading the SCALE lab. He is currently also a Visiting Professor at the Ecole Polytechnique Federale De Lausanne (EPFL) in Switzerland. Dr. Heinis is renowned for research and development of systems in large-scale data management systems such as MapReduce, noSQL, distributed main memory databases and parallel databases in general. His research particularly focuses on scaling out big data into the cloud for industrial and scientific (medical) applications. Dr. Heinis received a BSc, MSc and PhD from the Swiss Federal Institute of Technology in Zurich. During his studies he also received several fellowships, including a Fulbright fellowship (Purdue University).

Title: Benchmarking SQL-On-MapReduce systems

Abstract

In this work, we analyzed the ability of MapReduce based systems and supporting SQL to manage (storage, loading ...) large data sets. We considered different configurations to analyze various parameters: material, data, partitioning, indexing and selectivity. We analyzed the impact of data partitioning, indexing and compression on query performance. We also motivated the need for new techniques for optimizing queries for emerging systems. From our experiments, it follows that there is no "specific" technique to partition, store and index data but the efficiency of dedicated techniques depends mainly on the type of queries and the typology of data that are considered. Based on our work on benchmarking, we propose some techniques to be integrated to large-scale data management systems. We are working on a new approach to support multiple partitioning mechanisms and several evaluation operators.

Speaker

Mohand-Saïd Hacid is a full Professor in Computer Sciences at the Université Claude Bernard Lyon 1. He is leading the LIRIS laboratory (http://liris.cnrs.fr). His research areas include query languages for information systems, semantic web, data security.

Title: Agile data management challenges in enterprise big data landscapes

Abstract

Enterprise data landscapes evolved to include big data storage systems, and massively scalable distributed data processing systems, in addition to the traditional enterprise systems managed by IT organizations such as enterprise applications, transactional systems, data warehouses, and Business Intelligence platforms. These big data systems offer new opportunities to collect and analyze vast amount of real-time data representing events generated by business activity over the Internet, or signal data generated by automated data sources (Internet of Things), and derive real-time analytics using machine learning and data mining techniques. The resulting new “Enterprise Big Data landscape” however yields an increased complexity including diverse technology stacks, a data processing spread within and between stacks, huge data volume and movement, a functional overlap of data processing capabilities, and diverse technical requirements to meet the needs of data scientists, data analysts, and application developers, all having very different skills. A siloed approach separating traditional enterprise data management from big data management is not desirable because a lot of the business value is acknowledged to come from the integration of both types of data, and the proper management of the entire data processing chain. Finally, IT organizations expect an enterprise-level of service for their big data including end to end security, landscape administration, and software lifecycle management. In this talk, I will give a brief overview of the recent comprehensive SAP Data Hub platform that unifies the management of both enterprise data and big data and its main features. I will then focus on two key functionalities which are the governance at scale of an enterprise big data landscape through a unified view, and the development of a powerful and productive environment for the creation of analytics by the masses, that is, the largest number of business users, data analysts and data scientists. I will review some of the research challenges that need to be resolved in this area and outline some directions currently taken by SAP.

Speaker

Eric Simon is currently Chief Scientist in SAP’s Big Data organization. He was previously a chief architect in the Enterprise Information Management division, and development lead of advanced metadata management and semantic search facilities within SAP HANA platform. Prior to SAP, he led the development of Data Federator, a software component that enabled “multi-source universes” within Business Objects’s flagship product BOE Enterprise XI 4.0, a most awaited and acclaimed feature by customers when the product was issued in 2010. Eric joined Business Objects in 2005 with the acquisition of Medience, a start-up he co-founded in 2001 and run as CEO, which specialized in data federation technology with innovative query processing and schema mapping capabilities that fit the needs of BI applications. Prior to that, Eric was a tenure research scientist at INRIA, France. There, he created the project on a mediation system, called Le Select, whose technology was transferred to Medience. Eric published research results in various topics like database integrity and concurrency control, object-oriented programming, deductive databases, query optimization, and data cleaning. He was the recipient for two best paper awards at VLDB and ACM OOPSLA conferences. He is the co-author of several patents at Bell Labs, Medience, Business Objects, and SAP. Eric received a PhD and Habilitation in Computer Science from University of Paris VI in 1986 and 1992 respectively.