Representing databases of materials and molecules in two dimensions

The research highlighted in this post is part of the following paper: “Cluster-based multidimensional scaling embedding tool for data visualization”, by P. Hernández-León and M.A. Caro, Phys. Scr. 99, 066004 (2024). Available Open Access . The figures in this article are reproduced from that publication under the CC-BY 4.0 license.

One of the consequences of the emergence and popularization of the atomistic machine learning (ML) field is the proliferation of databases of materials and molecules, which contain detailed information about the atomistic structure of these compounds in the form of Cartesian coordinates with the atomic positions (and the lattice vectors for periodic simulation cells). These databases often contain thousands of individual structures. Developing an understanding of what is in the database and how the different structures are different (or similar) to each other becomes difficult. When comparing and visualizing high dimensional data, different ML techniques can be useful. In addition, when comparing structural data points with different dimensions, i.e., atomic environments with different numbers of atoms, atomistic ML techniques also become useful.

Examples of ML tools for data visualization are dimensionality-reduction and low-dimensional embedding techniques. For instance, the Scikit-learn library offers an easy to use Python interface to many of those, and a nice documentation page, categorized under “manifold learning“. A popular example of these is multidimensial scaling (MDS), which is also rather intuitive to use and understand: given a metric space in high dimensions, obtain a non-linear representation of the data in lower dimensions that preserves the metric structure of the original high-dimensional space as closely as possible. What does this mean? If we can define a notion of “distance” in the original space \{d_{ij}\}, in terms of how far away two data points i and j are from each other, and these distances obey the triangle inequality*, then we can generate a non-linear embedding of the data in lower dimensions (e.g., \{(\tilde{x},\tilde{y})\} in two dimensions, with \tilde{d}_{ij} = \sqrt{(\tilde{x}_i-\tilde{x}_j)^2 + (\tilde{y}_i-\tilde{y}_j)^2}) by trying to minimize the residual \sum_{ij} (d_{ij} - \tilde{d}_{ij})^2, optimizing the embedded coordinates.

Therefore, provided that we have a measure of distance/dissimilarity between our atomic structures in high dimensions, we can in principle use MDS (or another embedding technique) to visualize the data. But how do we obtain such distances? That is, how do we compare the set of 3N atomic coordinates \{x_1, y_1, z_1, ..., x_N, y_N, z_N\} in structure A to the set of 3M atomic coordinates \{x'_1, y'_1, z'_1, …, x'_M, y'_M, z'_M\} in structure B? If we just naively compute the Euclidean distance, d_{AB} = \sqrt{(x_1-x'_1)^2 + (y_1-y'_1)^2 + (z_1-z'_1)^2 + ...}, several problems become apparent. First, what if N \ne M? This is the most general scenario after all! Second, what if we shift all the coordinates of one of the structures (a translation operation)? The distance changes, but is this indicative of whether the structures are more or less different now? Third, what if we rotate one of the structures? Again, the same problem as with a translation. In fact, these translation and rotation arguments between A and B become even stronger when we consider a clearly pathological case: comparing A to a rotated or translated copy of itself would give the result that A and this copy are not the same anymore! To tackle these different issues, the atomistic ML community has resorted to using invariant atomic descriptors, often based on the SOAP formalism [1], to quantify the dissimilarity between structures and carry out the representation in two dimensions [2,3], i.e., to be able to visualize these databases in a plot within the “human grasp”. Here, the distance in the original space is given by the SOAP kernel*, d_{AB} = \sqrt{1 - (\textbf{q}_A \cdot \textbf{q}_B)^\zeta}, which provides a natural measure of similarity between atomic structures (dissimilarity when used to define the distance) accounting for translational and rotational invariance.

Given the high complexity of some of these databases, even these sophisticated SOAP distances in combination with regular embbeding tools will not do the trick. For instance, while developing our carbon-hydrogen-oxygen (CHO) database for a previous study on X-ray spectroscopy [4], we noticed that regular MDS got “confused” by the large amount of data and had a tendency to represent together environments that were very different, even environments centered on different atomic species.

To overcome this issue with regular MDS embedding, our group, with Patricia Hernández-León doing all the heavy lifting, has proposed to “ease up” the MDS task by splitting the overall problem into a set of smaller problems. First, we group similar data together according to their distances in the original space – this is important because the distances in the original space (before embedding) preserve all the information, much of which is lost after embedding. We use an ML clustering technique, k-medoids, to build n clusters, each containing similar data points, i.e., atomic structures. Then, we proceed to embed these data hierarchically: define several levels of similarity, so that the embedding problem is simpler within each level because 1) there is less data and 2) the data are less dissimilar. Schematically, the hierarchical embedding looks like this:

The low-level (more local) embeddings are transferred to the higher-level (more global) maps through the use of what we call “anchor points”; these are data points (up to 4 per cluster) that serve as refence points across different embedding levels. We call this method cluster-based MDS (cl-MDS) and both the paper [5] and code [6] are freely available online. With this method, the two-dimension representation of the CHO database of materials is now much cleaner than before (see featured image at the top of this page) and the method can be used in combination with information highlights, e.g., to denote the formation energy or other properties of the visualized compounds. Here is an example dealing with small PtAu nanoparticles, where the (x,y) coordinates are obtained from the cl-MDS embedding and the color highlighting shows different properties:

While our motivation to develop cl-MDS and our case applications are in materials modeling, the method is general and can be applied to other problems within the domain of visual representation of high-dimensional data.

References

  1. A.P. Bartók, R. Kondor, and G. Csányi. “On representing chemical environments“. Phys. Rev. B 87, 184115 (2013). Link.
  2. S. De, A.P. Bartók, G. Csányi, and M. Ceriotti. “Comparing molecules and solids across structural and alchemical space“. Phys. Chem. Chem. Phys. 18, 13754 (2016). Link.
  3. B. Cheng, R.-R. Griffiths, S. Wengert, C. Kunkel, T. Stenczel, B. Zhu, V.L. Deringer, N. Bernstein, J.T. Margraf, K. Reuter, and G. Csányi. “Mapping materials and molecules“. Acc. Chem. Res. 53, 1981 (2020). Link.
  4. D. Golze, M. Hirvensalo, P. Hernández-León, A. Aarva, J. Etula, T. Susi, P. Rinke, T. Laurila, and M.A. Caro. “Accurate Computational Prediction of Core-Electron Binding Energies in Carbon-Based Materials: A Machine-Learning Model Combining Density-Functional Theory and GW“. Chem. Mater. 34, 6240 (2022). Link.
  5. P. Hernández-León and M.A. Caro. “Cluster-based multidimensional scaling embedding tool for data visualization“. Phys. Scr. 99, 066004 (2024).
  6. https://github.com/mcaroba/cl-MDS

Leave a Reply

Your email address will not be published. Required fields are marked *