10 October 2017: Scheduled talks for 'Earth, Space and Life Science' session
Peter Baumann, Jacobs University: Professor and head of the Large-Scale Scientific Information Systems research group.
Romulo Goncalves, Nederland eScience Center: Expert in Databases, Data Structures, Distributed Computing.
- Introduction, Peter Bauman
Title: Big Earth Data today: Challenges, Approaches, and Standards
With the unprecedented increase of orbital sensor, in-situ measurement, and simulation data there is a rich, yet not leveraged potential for getting insights from dissecting datasets and rejoining them with other datasets. Obviously, the goal of providing "analysis ready data" is to allow users to "ask any question, any time" thereby enabling them to "build their own product on the go", to cite some key phrases of the Earth sciences community.
Obviously, the "Big Data" in the Earth sciences stem from observations (like satellite imagery) and simulations (like weather forecasts). Once sampled and digitized they typically form some (regular or irregular) grid in n dimensions, such as 1D sensor timeseries, 2D imagery, 3D x/y/t image timeseries and x/y/z geophysical voxel imagery, 4D x/y/z/t climate and ocean data, etc. Recently, the concept of "datacubes" is emerging in the community for providing single logical objects instead of a zillion of files which are hard to find and understand.
In our talk we look at these "Big Earth Data" from a database perspective, addressing conceptual as well as architectural aspects and also standardization in the field. We will look at real life projects and operational databases using as the running example the rasdaman Array DBMS which has been chosen as official Reference Implementation for "Earth Datacubes" by the OGC and INSPIRE standardization bodies.
Peter Baumann, Jacobs University, Bremen, Germany
Dr. Peter Baumann is Professor of Computer Science, inventor, and entrepreneur. At Jacobs University, Bremen, Germany he researches on scalable multi-dimensional array databases ("datacubes") and their application in science and engineering. With his work on algebra, query languages, and efficient architectures culminating in the rasdaman array DBMS he has coined the research field of array databases. He has published 130+ book chapters and journal and conference articles, holds international patents on array database technology, and has received numerous international innovation awards for his work. The rasdaman technology is in operational use on Petascale spatio-temporal databases. In 2014, rasdaman has been ranked winner of the Big Data Challenge posed by T-Systems as part of the Copernicus Master competition, in 2016 rasdaman has been ranked top 100 Big Data technology by US magazine CIO Review.
Peter Baumann is active, often leading contributor to standardization. He has initiated and is editor of forthcoming ISO SQL/MDA ("Multi-Dimensional Arrays"). in the Open Geospatial Consortium (OGC) he is chairing the Big Earth Data working groups. In OGC and ISO TC211 he is editor of the "Big Earth Datacube" standards CIS, WCS, and WCPS, which also have been adopted by the European common Spatial Data Infrastructure, INSPIRE. OGC has honored his contribution to Big Data standardization with the prestigious Kenneth Gardels Award for "significant and enduring advances in technical standards".
Title: High-cadence Astronomy: Challenges and Applications for Big Databases
Optical and radio telescopes planned for the near future, e.g., LSST, SKA, BlackGEM, will generate vast data streams to meet their scientific goals, e.g., high-speed all-sky surveys, searches for rapid transient and variable sources, cataloguing the multi-millions of sources and their thousands of measurements. These high-cadence instruments challenge many aspects of contemporary data management systems. How do we keep pace and store these huge amounts of petabytes of scientific data, how do we query the data scientifically with acceptable response times? In this talk I will highlight how database technologies can contribute to the new field of high-cadence astronomy, in particular for the new telescope MeerLICHT. The MeerLICHT optical telescope –the predecessor of the BlackGEM telescopes, designed to follow up on gravitational-wave detections– is such a high cadence instrument. It adopts modern database technologies as key components in its automated full-source pipeline. The instrument’s cadence time is 1 minute. At this pace all the detected sources need to be added and cross-matched with existing sources inside the database. This archive is then opened for scientific analysis of the millions of cataloged sources and their light-curves, i.e. time series. Column-oriented main-memory database technologies have many advantages in different Big Data science domains. Extensions of database functionality with UDFs written in SQL and Python or R allow statistical queries to run deeply inside the database without pumping data around. Support of SQL management of external data makes loading of binary data, e.g., FITS files, extremely fast. I will demonstrate how these new techniques are applied the MeerLICHT pipeline environment. Furthermore, I will show examples and promising performance results of such a full-source database pipeline. I will show the differences between stand-alone, partitioned and distributed database configurations also in the context of LSST data.
Dr. Bart Scheers is a researcher in the Database Architectures Group at Centrum Wiskunde & Informatica (CWI, the Netherlands national research institute for mathematics and computer science). He has a background in astrophysics and worked on the design and implementation of the LOFAR transients detection framework. His research interests and expertise include transient and variable sources of astrophysical origin, statistical query processing of large data volumes, distributed, sharded and layered astronomical databases.
Title: Simulating the Universe with Eagle and SWIFT
The aim of the Eagle is to recreate the Universe, providing a laboratory for the formation of galaxies, the visible building blocks of the Universe. Our approach involves a clear calibration strategy, involving 100s of smaller scale simulations that necessitated an automated pipeline for analysis. The simulation programme required over 40 M-cpu-hr in order to develop the code an calibrate the sub-grid parameters. Including these calibration runs, the total data set is 0.5 Pb (after compression), with 400 simulation outputs sampling the particle distribution every 100 Million years. A major concern is that the next generation simulation that is scheduled to run on CSCS Piz Daint system. Improvements to the code speed from the SWIFT simulation code will allow us to undertake a volume 15 times larger. However, storing such simulation output, even temporarily, becomes impossible, and I will discuss how we are developing particle streaming and on-the-fly analysis solution to deal with this data avalanche.
Richard Bower, Professor of Cosmology at Durham University/ICC (Institute for Computational Cosmology)
Title: Neuroscience Data Integration: Big Data Challenges in Brain Atlasing
The Human Brain Project, an EU Flagship Initiative (HBP), is currently building an infrastructure to integrate neuroscience data, to develop a unified multi-level understanding of the brain and its diseases, and ultimately to emulate its computational capabilities. Brain atlases are one of the key components in this infrastructure. With a new generation of three-dimensional digital brain atlases, new solutions for analyzing and integrating brain data are being developed. In this talk, I will discuss the use of digital brain atlases as a basis for building services similar to current online geographical atlases, such as Google Maps and Google Earth, providing interactive access to huge amounts of high resolution image data, together with additional information and detailed visualizations. With a set of tools to interact with the atlases currently being developed, neuroscience research groups can connect their data to atlas space, share the data through online data systems, and search and find other relevant data through the same systems. This new approach partly replaces earlier attempts to organize research data based only on a set of semantic terminologies describing the brain and its subdivisions.
Jan Bjaalie from University of Oslo, Sub-Project leader in the Human Brain Project is Professor of Neuroscience at the University of Oslo and leader of the Neuroinformatics Platform of the Human Brain Project. He is coordinator of the Norwegian Node of Neuroinformatics, member of the Council for Training, Science, and Infrastructure of the International Neuroinformatics Coordinating Facility (INCF), and has served as founding Executive Director of the INCF. The group of Jan Bjaalie develops data systems for gathering, organizing, analyzing, and disseminating image data from microscopes and scanners, with tools for assigning metadata and for registration of images to standardized atlas space for mouse and rat brain.