Algorithms for Visualization in Digital Humanities

Thom Castermans

first promotor: prof.dr. B. Speckmann (TU/e)
second promotor: dr. M.A. Westenberg (TU/e)
co-promotor: dr. K.A.B. Verbeek (TU/e)
Eindhoven University of Technology
Date: 29 August 2019
Thesis: PDF

Summary

This dissertation encompasses the most important results of the interdisciplinary research project CatVis, which is a collaborative effort between computer scientists, specifically researchers working on applied geometric algorithms and visualization; data scientists working at OCLC, a company that hosts WorldCat, the world’s largest database with information on library collections; and philosophers with a strong interest in the methodological foundations of computational approaches and visualization research. The goal of CatVis is to provide visual tools and methods that allow librarians to manage and understand hundreds of millions of bibliographic records, and to develop visual tools and methods that aid philosophers in their research.

We describe a number of methodological and philosophical challenges that arose within CatVis. The challenges we describe concern aspects of one single epistemic need: that of methodologically securing (an increase in) trust in visualizations. We discuss the lack of ground truths in the (Digital) Humanities and argue that trust in visualizations requires that we evaluate visualizations on the basis of ground truths that Humanities scholars themselves create. We further argue that trust in visualizations requires that a visualization provides provable guarantees on the faithfulness of the visual representation and that we must clearly communicate to the users which part of the visualization can be trusted and how much. We also discuss transparency and accessibility in visualization research and provide measures for securing them.

After that we present GlamMap, a visualization tool for large, multi-variate georeferenced Humanities data sets. It visualizes the data as glyphs on a zoomable geographic map, and performs clustering and data aggregation at each zoom level to avoid clutter and to prevent overlap of symbols. GlamMap was developed for the Galleries, Libraries, Archives, and Museums (GLAM) domain in cooperation with researchers in philosophy and linguistics.

GlamMap poses an interesting theoretical problem that can be formalized as follows. Consider a set of disjoint square glyphs on an interactive map. When the user zooms out, the glyphs grow relative to the map, possibly with different speeds. When two glyphs intersect, we wish to replace them by a new glyph that captures the information of the intersecting glyphs. We present a fully dynamic kinetic data structure that maintains a set of n disjoint growing squares. Our data structure uses O(n log n loglog n) space, supports queries in worst case O(log² n) time, and updates in O(log⁵ n) amortized time. This leads to an O(n α(n) log⁵ n) time algorithm to solve the agglomerative clustering problem, a significant improvement over the original O(n²) time GlamMap algorithm.

Although our algorithm is asymptotically much faster than previous approaches, from a practical point of view, it does not perform better for n ≤ 10⁶. We present a new agglomerative clustering algorithm which works efficiently in practice. Our algorithm relies on the use of quadtrees to speed up spatial computations. Interestingly, even in non-pathological data sets we can encounter large glyphs that intersect many quadtree cells and that are involved in many clustering events. We therefore devise a special strategy to handle such large glyphs. We test our algorithm on several synthetic and real-world data sets and show that it performs well in practice.

We contribute a novel type of low distortion radial embedding which focuses on one specific entity and its closest neighbors. Our embedding preserves near-exact distances to the focus entity and aims to minimize distortion between the other entities. We present an interactive exploration tool SolarView, which places the focus entity at the center of a solar system and embeds its neighbors guided by concentric circles. SolarView provides an implementation of our novel embedding and several state-of-the-art dimensionality reduction and embedding techniques, which we adapted to our setting in various ways. We experimentally evaluate our embedding and compare it to these state-of-the-art techniques. The results show that our embedding competes with these techniques and achieves low distortion in practice. Our method performs particularly well when the visualization, and hence the embedding, adheres to the solar system design principle of our application. Nonetheless –as with all dimensionality reduction techniques– the distortion may be high. We leverage interaction techniques to give clear visual cues that allow users to accurately judge distortion.

Motivated by set visualization, we study plane supports of hypergraphs, where the vertices have fixed locations in the plane, and the supports have straight-line edges. Following principles well-established in the visualization community, we investigate finding plane supports with minimum total edge length. After showing that this is NP-hard, we provide a sufficient condition for the existence of plane supports. Our main contribution however lies in providing a number of heuristic algorithms for finding plane supports with low total edge length: short plane supports. We implement these algorithms, and experimentally compare different settings to evaluate the effects of, e.g., dropping the planarity requirement, or requiring that the support is a tree.