11 October 2017: Scheduled talks for Polystore session

Organizer:

Databricks logo
Spark logo
Flink logo
INRIA logo
Leanxcale logo


Title: An overview of polystores

Abstract

Building modern data-intensive applications often requires using multiple data stores (NoSQL, HDFS, RDBMS, etc.), each optimized for one kind of data and tasks. However, the wide diversification of data store interfaces makes it difficult to access and integrate data from multiple data stores. This important problem has motivated the design of a new generation of systems, called polystores, which provide integrated or transparent access to a number of data stores through one or more query languages. In this presentation, we give an overview of polystores, by analyzing some representative multistore systems, based on their architecture, data model, query languages and query processing techniques. To ease comparison, we divide multistore systems based on the level of coupling with the underlying data stores, i.e., loosely-coupled, tightly-coupled and hybrid. Our analysis reveals some important trends, which we discuss. We also identify some major research issues.

Speaker

Patrick Valduriez from Inria and LIRMM, Montpellier, France

Patrick Valduriez is a senior researcher at Inria and LIRMM, University of Montpellier, France. He has also been a professor of Computer Science at University Paris 6 and a researcher at Microelectronics and Computer Technology Corp. in Austin, Texas. He received his Ph. D. degree and Doctorat d'Etat in CS from University Paris 6 in 1981 and 1985, respectively. He is the head of the Zenith team (between Inria and University of Montpellier, LIRMM) that focuses on data management in large-scale distributed and parallel systems (P2P, cluster, grid, cloud), in particular, scientific data management. He has authored and co-authored over 400 technical papers and several textbooks, among which “Principles of Distributed Database Systems”. He currently serves as associate editor of several journals, including the VLDB Journal, Distributed and Parallel Databases, and Internet and Databases. He has served as PC chair of major conferences such as SIGMOD and VLDB. He was the general chair of SIGMOD04, EDBT08 and VLDB09. He obtained the best paper award at VLDB00. He was the recipient of the 1993 IBM scientific prize in Computer Science in France and the 2014 Innovation Award from Inria – French Academy of Science – Dassault Systems. He is an ACM Fellow.

Title: Structured Streaming in Apache Spark: Easy, Fault Tolerant and Scalable Stream Processing

Abstract

Last year, in Apache Spark 2.0, we introduced Structured Steaming, a new stream processing engine built on Spark SQL, which revolutionized how developers could write stream processing application. Structured Streaming enables users to express their computations the same way they would express a batch query on static data. Developers can express queries using powerful high-level APIs including DataFrames, Dataset and SQL. Then, the Spark SQL engine is capable of converting these batch-like transformations into an incremental execution plan that can process streaming data, while automatically handling late, out-of-order data, and ensuring end-to-end exactly-once fault-tolerance guarantees.

Since Spark 2.0 we've been hard at work building first class integration with Kafka. With this new connectivity, performing complex, low-latency analytics is now as easy as writing a standard SQL query. This functionality in addition to the existing connectivity of Spark SQL make it easy to analyze data using one unified framework. Spark 2.2 removes the "Experimental" tag from all Structured Streaming APIs.

Speaker

Juliusz Sompolski from Databricks

Juliusz Sompolski is a software engineer working at the Databricks engineering office in Amsterdam. Graduated with MSc from University of Warsaw and Vrije Universiteit Amsterdam in 2011, he has been working with database engines and data processing frameworks since then.

Abstract

Apache Flink is an open-source system for expressive, declarative, fast, and efficient data analysis on both batch and streaming data. Flink combines the scalability and programming flexibility of distributed MapReduce-like platforms with the efficiency, out-of-core execution, and query optimization capabilities found in parallel databases. At its core, Flink builds on a distributed dataflow runtime that unifies batch and incremental computations over a true-streaming, pipelined execution engine. Its programming model allows for stateful, fault tolerant computations, flexible user-defined windowing semantics for streaming and unique support for iterations. Flink is converging into a use-case complete system for parallel data processing with a wide range of top level libraries including machine learning and graph processing. Apache Flink originates from the Stratosphere project led by TU Berlin.

Speaker

Tilmann Rabl from Apache Flink

Tilmann Rabl is a Visiting Professor at the Database Systems and Information Management (DIMA) group at the Technische Universität Berlin. At DIMA he is research director and technical coordinator of the Berlin Big Data Center (BBDC). Tilmann received his PhD at the University of Passau in 2011. He spent 4 years at the University of Toronto as a postdoc in the Middleware Systems Research Group (MSRG). In his PhD thesis, Tilmann invented the Parallel Data Generation Framework (PDGF), for which he received the Transaction Performance Processing Council’s (TPC) Technical Contribution Award. In Toronto, he received a MITACS Award in 2013 and 2014 and an IBM CAS postdoctoral fellowship in 2013 and 2014. He is a professional affiliate of the TPC and co-founder and chair of the SPEC Research working group on big data. Tilmann is member of the steering committee of the Workshop on Big Data Benchmarking (WBDB) series and member of the board of directors of the BigData Top100 List. Tilmann is also CEO and co-founder of the startup bankmark, which has been awarded with an EXIST award, the IKT Innovativ Award 2014, and the Weconomy Award 2015 among others.

Title: Big Data Analytics over Operational Data with LeanXcale

Abstract

In this talk it will be covered how to perform big data analytics over operational data without performing any data copies and/or streaming the data. The talk will present the LeanXcale approach to multi-datastores in which LeanXcale OLAP query engine has been integrated will multiple data stores such as MongoDB, Neo4J, HBase and Hadoop data lakes enabling to perform queries across its own relational data and the data stored at all these different data stores. LeanXcale differentiation comes from two innovations. The ability to query the operational data and the ability to combine the native query languages/APIs of the underlying NoSQL data stores with the simplicity of SQL.

Speaker

Ricardo Jimenez-Peris from LeanXcale

Dr. Ricardo Jimenez is CEO & Co-Founder of LeanXcale. Before, he was a the Director of the Distributed Systems Lab (LSD) at TU Madrid (UPM) having researched over 25 years on scalable transactional and data management systems. He has been an invited speaker at top tech companies in Silicon Valley such as Twitter, Salesforce, EMC-Heroku, Cloudera, Microsoft, HortonWorks, HP, Greenplum (current Pivotal), etc. He is co-inventor on two patents, co-author of a book and 100+ papers on international journals and conferences. He is currently technical coordinator of the CloudDBAppliance H2020 project and had the same role in the CoherentPaaS, LeanBigData, CumuloNimbo and Stream EU projects.