Reduction of Large-Scale Scientific Data with Topological Data Analysis and Neural Networks


  • Julien Tierny, Research Director, UMR 7606, LIP6
  • Joshua Levine, Computer Science, UArizona


Our computational science capabilities have created a recent explosion of data that is rapidly changing how the scientific community investigates complex phenomenon. In particular, we are now leveraging high performance computing resources to simulate across a variety of applications of interest. The challenges of analyzing scientific data are particularly severe because they are large and unwieldy. Pioneering work in topological data analysis (TDA) has achieved extremely promising results for analyzing individual datasets in the last two decades. TDA’s main strengths are that (1) it is supported by theoretical background (from computational topology) that shows it is both mathematical robust and efficiently computed, (2) computing it produces not only a summary set of features from data, but also the ability to rank and filter these features, and (3) topological features are often straightforward to visualize. Moreover, recent implementations (including work by the PIs [58] as well as others) have translated these properties to actual analysis frameworks, helping to deliver on the promises. That said, there are still significant gaps that require foundational research in TDA. Th3 next generation of computational tools promises to enable scientists to imagine and test new hypotheses and theories, provided the techniques to analyze the data they produce can keep pace.

This project will tackle the two primary issues that are the root cause of scalability. The first is that TDA for multi-fields is still in its infancy. The second, more critical issue, is that TDA’s implementation is often focused on spatial rather than spatio-temporal data. This project uses a unique approach that will rely on the best parts of TDA while coupling TDA with modern tools, such as neural networks, from machine learning (ML) to bridge the gap. The primary goal is thus to develop the next generation of coupled ML+TDA tools that are amenable to time-varying data analysis. If successful, this would bring to bear the power of current TDA approaches to significantly more complex data. In turn, by being able to analyze and visualize this complex data, we can push the envelope for computational science, making good on the promise of not only being able to model complex physical phenomena but also being able to make sense of them. This would lead to new fundamental discoveries in Computer Science (the core area of both PIs) while also benefiting a wide variety of application domains where TDA is already being utilized within our home institutions and beyond.

This project builds on a 12 year collaboration and will allow it to deepen as well as expose the project P.h.D students to multiple environments, perspectives, and to train in an interdisciplinary way - helping the students create an international collaborative network at the start of their career. Effective data analysis requires a scientist to wear multiple hats: they must understand the fundamental mathematics, how to compute them efficiently, and how to communicate and engage with scientists who are experts in the physical phenomena they model. Both PIs use their expertise in visualization to provide a cross-cutting platform for learning how to be data scientists. A next logical step is expanding their collaboration to integrate training personnel across their campuses further.