The **atomistic modeling field is rapidly evolving** in the midst of a machine learning (ML) and artificial intelligence (AI) driven **revolution**. Simulations of molecules and materials, involving thousands and even millions of atoms, or molecular dynamics (MD) simulations spanning long simulation times, are now routinely done with (close to) *ab initio* accuracy. These were *unthinkable* just a decade ago. Yet, **the ultimate test for any theory and simulation is experiment**, and achieving experimental agreement and incorporating experimental data directly into these atomistic simulations is necessarily the **next frontier in atomistic modeling**.

A central object of atomistic materials modeling is obtaining the atomic-scale structure of materials. This is particularly important (and interesting) when the material lacks crystalline ordering and experimental techniques like X-ray diffraction (XRD) cannot be used to determine its structure. Amorphous and generally disordered materials are one such class of materials, but also liquids and interfaces are relevant here. Through atomistic modeling we can attempt to derive structural models of these materials, e.g., a set of representative structures given in terms of the atomic positions (the Cartesian coordinates of the atoms). After deriving these models, we want to connect the simulation with the experiment, both to verify the soundness of the computational approach and to gain atomistic insight into the structure of the material. One of the different ways to do this, indirectly, is by using spectroscopy techniques, like X-ray photoelectron spectroscopy (XPS). On the one hand, we measure the XPS spectrum of the material experimentally; on the other, we make a computational XPS prediction for our candidate structural model. If they agree, we gain confidence that the model resembles reality; if they don’t, we keep looking for better structural models. (Of course, the whole story is more nuanced but this is the gist of it).

During our previous work on XPS prediction in collaboration with Dorothea Golze, we showed how XPS prediction for amorphous carbon-based materials can be made quantitatively accurate via a combination of electronic structure methods and atomistic machine learning technology. One of the disappointing aspects of that work is that we were unable to generate structural models of oxygenated amorphous carbon (a-CO_{x}) whose XPS spectra matched the experimental one. Since we had very good confidence in the accuracy of our computational XPS prediction, the necessary consequence was that the computational models of the structure of a-CO_{x} we were working with did not resemble the experimental samples. At the time, these models had been provided by our collaborator Volker Deringer from expensive DFT calculations, and we only had access to three a-CO_{x} models with a couple hundred atoms and different oxygen content. I was left intrigued by this issue and decided to train a machine learning potential (MLP) for CO. With this MLP, I could efficiently generate lots (thousands if not millions) of different samples at different conditions and thought: “surely, one of them will give the match with experiment”. But this was not the case! The reason is that the experimental growth is a non-equilibrium process involving energetic deposition of atoms: C atoms are deposited onto a growing film in an oxygen atmosphere, and the oxygen atoms are co-deposited with the carbon atoms into the film (see the nice work by Santini *et al*.). Direct simulation of the deposition process is very challenging. Our computational generation protocol was based on indirect structure generation routes and favored the formation of thermodynamically stable products: solids with low oxygen content and lots of CO and CO_{2} molecules.

At this point, we had to think a bit outside the box. The idea that came to mind was something that is known in the literature as reverse Monte Carlo (RMC): the atomic positions are updated and the observable directly comparable to experiment is monitored on the fly, such that moves that increase the agreement between simulation and experiment for said observable are favored. After many steps, computational and experimental observables will agree. There are (at least) two problems with this approach, and both are related to the need for cheap calculations, given the sheer number of individual evalutations of the system’s energy required in Monte Carlo optimization. First, RMC for materials has traditionally been done using XRD only, because this is computationally cheap to compute given a set of atomic positions. Second, the RMC protocol will not ensure that the generated structures are energetically sound (low in energy) as the only constraint is that the agreement with the experimental observable should improve. Previous work has dealt with this problem in the context of “hybrid” RMC (HRMC), where the optimization is done simultaneously on the observable agreement and the system’s energy via an interatomic potential. Again, the HRMC approach has traditionally been used with “cheap” interatomic potentials because the individual evaluations of the objective function (the function that depends on XRD and total energy and is being minimized) need to be efficient, because we need to do so many of them. But these classical interatomic potentials are not accurate enough to describe the complex chemical bonding taking place in a-CO_{x}!

But now, we have both an accurate description of the energy, afforded by our CO MLP, and a quantitatively accurate description of XPS signatures, thus addressing the issues inherent to HRMC preventing us from applying this approach to study a-CO_{x}. The only hurdle left at this point is how to handle the variable number of oxygen atoms in the samples: after all, we do not know how much oxygen is in there, this is actually one of the motivations for doing the simulations in the first place! To tackle this, we can resort to the grand-canonical version of Monte Carlo (GCMC), where a chemical potential can be defined for the oxygen atoms and they can be added and removed from the simulation.

With all of these ideas in place, the next step was to team up and getting it done. Together with **Tigany Zarrouk** and **Rina Ibragimova** from our group, and **Albert Bartók** from Warwick, we carried out an efficient implementation of these methods (special hats-off to Tigany for implementing this in the **TurboGAP** code!) and did lots of simulations. All the results and methodology are summarized in our paper, titled “Experiment-driven atomistic materials modeling: A case study combining X-ray photoelectron spectroscopy and machine learning potentials to infer the structure of oxygen-rich amorphous carbon” (and referenced at the beginning of this post). There, we leverage the predictive power of atomistic ML techniques but also their flexibility. We combine the accurate description of the potential energy surface of materials afforded by state-of-the-art MLPs with on-the-fly prediction of XPS. This allows us to make a direct link between the atomistic structure optimization procedure and the experimental structural fingerprint, such that it is the agreement between experiment and simulations that drives the structure optimization. The analysis reveals the elemental composition and atomic motif distribution in a-CO_{x}, as well as pointing toward a maximum oxygen content in carbon-based materials of about 30%.

While the method is illustrated in our manuscript for XPS as experimental analytical technique and a-CO_{x} as an interesting (and challenging) test-case material, the methodology (which we call “modified Hamiltonian” approach) is general and we are already extending it to incorporate other techniques, like XRD, Raman spectroscopy, etc. More generally, an ensemble of experimental techniques can be combined, e.g., to overcome one of the known limitations of XPS (and other individual techniques on their own), namely that more than one atomic motif can contribute in the same spectral region.

In addition to all the things already said, the paper also touches on a somewhat sensitive topic for the experimental materials community: the fact that, while a widely used technique, XPS analyses are often plagued by (incorrect) assumptions and suffer to a certain degree from arbitrariness in the fits. This is particularly true for carbon-based materials. But in pointing out the problem we also point out the solution: to incorporate ML-driven atomistic simulation into the normal workflows of XPS fitting procedures and, in the future, also other analytical techniques. This prospect depicts a very interesting time ahead in the field of structural and chemical analysis of materials and molecules.

*This work was supported financially by the Research Council of Finland and benefitted from computational resources provided by CSC (the Finnish IT Center for Science) and the Aalto University Science-IT project.*

One of the consequences of the emergence and popularization of the **atomistic machine learning** (ML) field is the proliferation of **databases of materials and molecules**, which contain detailed information about the atomistic structure of these compounds in the form of Cartesian coordinates with the atomic positions (and the lattice vectors for periodic simulation cells). These databases often contain thousands of individual structures. Developing an understanding of what is in the database and how the different structures are different (or similar) to each other becomes difficult. When comparing and visualizing high dimensional data, different ML techniques can be useful. In addition, when comparing structural data points with different dimensions, i.e., atomic environments with different numbers of atoms, atomistic ML techniques also become useful.

Examples of ML tools for data visualization are dimensionality-reduction and low-dimensional embedding techniques. For instance, the Scikit-learn library offers an easy to use Python interface to many of those, and a nice documentation page, categorized under “manifold learning“. A popular example of these is multidimensial scaling (MDS), which is also rather intuitive to use and understand: given a metric space in high dimensions, obtain a non-linear representation of the data in lower dimensions that preserves the metric structure of the original high-dimensional space as closely as possible. What does this mean? If we can define a notion of “distance” in the original space , in terms of how far away two data points and are from each other, and these distances obey the triangle inequality*, then we can generate a non-linear embedding of the data in lower dimensions (e.g., in two dimensions, with ) by trying to minimize the residual , optimizing the embedded coordinates.

The triangle inequality must be fulfilled for a space to be considered metric: , in other words, a “straight” path from to is always shorter or equal in distance as going from to via an intermediate point .

Therefore, provided that we have a measure of distance/dissimilarity between our atomic structures in high dimensions, we can in principle use MDS (or another embedding technique) to visualize the data. But how do we obtain such distances? That is, how do we compare the set of atomic coordinates in structure to the set of atomic coordinates in structure ? If we just naively compute the Euclidean distance, , several problems become apparent. First, what if ? This is the most general scenario after all! Second, what if we shift all the coordinates of one of the structures (a translation operation)? The distance changes, but is this indicative of whether the structures are more or less different now? Third, what if we rotate one of the structures? Again, the same problem as with a translation. In fact, these translation and rotation arguments between and become even stronger when we consider a clearly pathological case: comparing to a rotated or translated copy of itself would give the result that and this copy are not the same anymore! To tackle these different issues, the atomistic ML community has resorted to using invariant atomic descriptors, often based on the SOAP formalism [1], to quantify the dissimilarity between structures and carry out the representation in two dimensions [2,3], i.e., to be able to visualize these databases in a plot within the “human grasp”. Here, the distance in the original space is given by the SOAP kernel*, , which provides a natural measure of similarity between atomic structures (dissimilarity when used to define the distance) accounting for translational and rotational invariance.

A SOAP descriptor provides a rotationally and translationally invariant representation of atomic environment , usually centered on a given atom (so it would be the atomic environment of that atom). It is a high-dimensional vector, where its dimension depends on the number of species and how accurately one wants to carry out the representation, with anywhere between a few tens to a few thousands of dimensions being common. The SOAP kernel, , with , gives a bounded measure of similarity between 0 and 1 for environments and . The distance defined thereof, , thus gives a measure of dissimilarity between and .

Given the high complexity of some of these databases, even these sophisticated SOAP distances in combination with regular embbeding tools will not do the trick. For instance, while developing our carbon-hydrogen-oxygen (CHO) database for a previous study on X-ray spectroscopy [4], we noticed that regular MDS got “confused” by the large amount of data and had a tendency to represent together environments that were very different, even environments centered on different atomic species.

To overcome this issue with regular MDS embedding, our group, with **Patricia Hernández-León** doing all the heavy lifting, has proposed to “ease up” the MDS task by splitting the overall problem into a set of smaller problems. First, we group similar data together according to their distances *in the original space* – this is important because the distances in the original space (before embedding) preserve all the information, much of which is lost after embedding. We use an ML clustering technique, k-medoids, to build clusters, each containing similar data points, i.e., atomic structures. Then, we proceed to embed these data hierarchically: define several levels of similarity, so that the embedding problem is simpler within each level because 1) there is less data and 2) the data are less dissimilar. Schematically, the hierarchical embedding looks like this:

The low-level (more local) embeddings are transferred to the higher-level (more global) maps through the use of what we call “anchor points”; these are data points (up to 4 per cluster) that serve as refence points across different embedding levels. We call this method cluster-based MDS (cl-MDS) and both the paper [5] and code [6] are freely available online. With this method, the two-dimension representation of the CHO database of materials is now much cleaner than before (see featured image at the top of this page) and the method can be used in combination with information highlights, e.g., to denote the formation energy or other properties of the visualized compounds. Here is an example dealing with small PtAu nanoparticles, where the coordinates are obtained from the cl-MDS embedding and the color highlighting shows different properties:

While our motivation to develop cl-MDS and our case applications are in materials modeling, the method is general and can be applied to other problems within the domain of visual representation of high-dimensional data.

- A.P. Bartók, R. Kondor, and G. Csányi. “
*On representing chemical environments*“. Phys. Rev. B**87**, 184115 (2013). Link. - S. De, A.P. Bartók, G. Csányi, and M. Ceriotti. “
*Comparing molecules and solids across structural and alchemical space*“. Phys. Chem. Chem. Phys.**18**, 13754 (2016). Link. - B. Cheng, R.-R. Griffiths, S. Wengert, C. Kunkel, T. Stenczel, B. Zhu, V.L. Deringer, N. Bernstein, J.T. Margraf, K. Reuter, and G. Csányi. “
*Mapping materials and molecules*“. Acc. Chem. Res.**53**, 1981 (2020). Link. - D. Golze, M. Hirvensalo, P. Hernández-León, A. Aarva, J. Etula, T. Susi, P. Rinke, T. Laurila, and M.A. Caro. “
*Accurate Computational Prediction of Core-Electron Binding Energies in Carbon-Based Materials: A Machine-Learning Model Combining Density-Functional Theory and GW*“. Chem. Mater.**34**, 6240 (2022). Link. - P. Hernández-León and M.A. Caro. “
*Cluster-based multidimensional scaling embedding tool for data visualization*“. Phys. Scr.**99**, 066004 (2024). - https://github.com/mcaroba/cl-MDS

In the field of catalysis it is common to use rare metals because of their superior catalytic properties. For example, platinum and Pt-like metals show the best performance for water splitting, but are too scarce and expensive to be used for many industrial-scale purposes. Instead, research is intensifying on finding alternative solutions based on widely available and cheap materials, especially metallic compounds. (For an overview of different materials specifically for water splitting, see, e.g., the review by Wang et al.)

Of all metallic elements on the Earth’s crust, only aluminium is more abundant than iron. Both metals can be used for structural purposes. However, iron is easier to mine and can be used to make a huge variety of steel alloys with widely varying specifications. For these reasons, iron ore constitutes almost 95% of all industrially mined metal globally. Being an abundant and readily available commodity, the prospect of potentially replacing critical metals with iron is very attractive. This includes developing new Fe-based materials for catalysis.

Some of the main aspects (besides cost and availability) to consider when assessing the prospects of a catalyst material are 1) **activity** (how much product can we make with a given amount of electrical power), 2) **selectivity** (whether we make a single product or a mixture of products) and 3) **stability** (how long does the material and its properties last under operating conditions). For instance, a very active and selective material for the oxygen evolution reaction will not be useful in practice if it has a high tendency to corrode. In that regard, native iron surfaces are not particularly good catalysts. However, there are different ways how this can be tackled. One way to tune the properties of a material is via compositional engineering; i.e., by “alloying” two or more compounds we can produce a resulting compound with quantitatively or even qualitatively different properties compared to the precursors. Another way to tune these properties is by taking advantage of the structural diversity of a compound, because the catalytic activity of a material can be traced back to atomic-scale “active sites”, where the electrochemical reactions take place.

At ambient conditions, bulk (solid) iron has a body-centered cubic (bcc) structure, where every atom has 8 neighbors each at the same distance, and all atomic sites look the same. Iron *surfaces* have more diversity of atomic sites, depending on the cleavage plane and reconstruction effects. With very thin (nanoscale) surfaces, even the crystal structure can be modified from bcc to face-centered cubic (fcc). In surfaces, the exposed sites differ from those in the bulk, but are still relatively similar to one another (with a handful of characteristic available atomic motifs). However, when we move to *nanoparticles* (known as “nanoclusters”, when they are very small), ranging from a few to a few hundreds (or perhaps thousands) of atoms, the situation is significantly more complex. For small nanoparticles, the morphology of the available exposed atomic sites will depend very strongly on the size of the nanoparticle. And a single nanoparticle will itself display a relatively large variety of surface sites. And because the atomic environments of these sites are so different, so can also be their catalytic activity. Thus, active sites that are not available in the bulk can be present for the same material in its nanoscale form(s).

To understand and explore the diversity of atomic environments in nanoscale iron, we (mostly Richard Jana with some help from me) developed a new “general-purpose” machine learning potential (MLP) for iron and used it to generate “stable” (i.e., low-energy) iron nanoparticles. Iron is a particularly hard system for MLPs, because of the existence of magnetic degrees of freedom (related to the effective net spins around the iron atoms), in addition to the nuclear degrees of freedom (the “positions” of the atoms). Usually, MLPs (as well as traditional atomistic force fields) are designed to only account explicitly for the latter. For this reason, existing interatomic potentials have been developed to accurately describe the potential energy landscape of “normal” ferromagnetic iron (bcc iron, the stable form at ambient conditions), but fail for other forms, which are relevant at extreme thermodynamic conditions (high pressure and temperature) or at the nanoscale (nanoparticles). While our methodology is still incapable of explicitly accounting for magnetic degrees of freedom, by carefully crafting a general training database we managed to get our iron MLP to implicitly learn the energetics of structurally diverse forms of iron, and in particular managed to achieve very accurate results for small nanoparticles, where the flexibility of a general-purpose MLP is most needed.

We built a catalogue of iron nanoparticles from 3 to 200 atoms, and found structures that were lower in energy than many of those previously available in the literature. Using data clustering techiques, we could identify the most characteristic sites on the nanoparticle surfaces based on their morphological similarities. In the video below, you can see all the lowest-energy nanoparticles we found at each size (in the 3-200 atoms range) with their surface atoms colorcoded according to the 10 most characteristic motifs identified by our algorithm.

The reactivity of each site, e.g., how strongly it can bind an adsorbant, such as a hydrogen atom or a CO molecule, depends very strongly on the surroundings, especially the number of neighbor atoms and how they are arranged. For instance, surface atoms that are almost “buried” inside the nanoparticle are more stable (less reactive) than those which are sticking out and have only few neighbors. Sites that either bind adsorbants too strongly or not at all tend to have poor catalytic activity, whereas sites in between those are the most promising ones because they can transiently bind a reaction intermediate and subsequently release it, allowing the reaction to take place. We have made initial progress in drawing the connection between motif classification and activity based on the MLP predictions, as seen in the featured figure at the top of this page. Navigating this wealth of surface sites and thoroughly screening their potential to catalyze specific chemical reactions (with more application-specific ML models or with DFT) is the next logical step.

We hope that this work will stimulate further research into the catalytic properties of iron-based nanocatalysts and bring us one step closer to the cheap and sustainable development of electrocatalysts for industrial production of fuels and chemicals.

]]>On the way to understand the properties of an amorphous material, our first pit stop is understanding its atomic-scale structure. For this purpose, computational atomistic modeling tools are particularly useful, since many of the techniques commonly used to study the atomistic structure of crystals are not applicable to amorphous materials, precisely because they rely on the periodicity of the crystal lattice. Unfortunately, one of the most used computational approaches for materials modeling, density functional theory (DFT), is computationally too expensive to study the sheer structural complexity in real amorphous materials, in particular long-range structure for which many thousands or even millions of atoms need to be considered.

The introduction and popularization in recent years of machine learning (ML) based atomistic modeling and, in particular, ML potentials (MLPs), has enabled for the first time realistic studies of amorphous semiconductors with accuracy close to that of DFT. As two of the most important such materials, amorphous silicon (a-Si) and amorphous carbon (a-C) have been the target of much of these early efforts.

In a recent Topical Review in Semiconductor Science and Technology (provided Open Access via this link), I have tried to summarize our attempts to understand the structure of a-C and a-Si and highlight how MLPs allow us to peek at the structure of these materials and draw the connection between this structure and the emerging properties. The discussion is accompanied by a general description of atomistic modeling of a-C and a-Si and a brief introduction to MLPs, and it could be interesting to the material scientist curious about the modeling of amorphous materials or the DFT practicioner curious about what MLPs can do that DFT can’t.

At this stage, the field is still evolving (fast!) and I expect(/hope) this review will become obsolete soon, as more accurate and CPU-efficient atomistic ML techniques become commonplace. Especially, I expect the description of a-Si and a-C to rapidly evolve from the study of pure samples to the more realistic (and chemically complex) materials, containing unintentional defects as well as chemical functionalization. All in all, I am excited to witness in which direction the field will steer in the next 5-10 years. I promise to do my part and wait at least that much before writing the next review on the topic!

]]>Many popular experimental methods for determining the structure of materials rely on the periodic repetition of atomic arrangements present in crystals. A common example is X-ray diffraction. For amorphous materials, the lack of periodicity renders these methods impractical. Core-level spectroscopies, on the other hand, can give information about the distribution of atomic motifs in a material without the requirement of periodicity. For carbon-based materials, X-ray photoelectron spectroscopy (XPS) is arguably the most popular of these techniques.

In XPS a core electron is excited via absorption of incident light and ejected out of the sample. Core electrons occupy deep levels close to the atomic nuclei, for instance 1s states in oxygen and carbon atoms. They do not participate in chemical bonding since they are strongly localized around the nuclei and lie far deeper in energy than valence electrons. Because core electrons lie so deep in the potential energy well, energetic X-ray light is required to eject them out of the sample. In XPS, the light source is monochromatic, which means that all the X-ray photons have around the same energy, . When a core electron in the material absorbs one of these photons with enough energy to leave the sample, we can measure its kinetic energy and work out its binding energy (BE) as .

After collecting many of these individual measurements, a *spectrum* of BEs will appear, because each core electron has a BE that depends on its particular atomic environment. For instance, a core electron from a C atom bonded only to other C atoms has a lower BE than a C atom that is bonded also to one or more O atoms. And even the details of the bonding matter: a core electron from a “diamond-like” C atom (a C atom surrounded by four C neighbors) has a higher BE than that coming from a “graphite-like” C atom (which has only three neighbors). Therefore, the different features, or “peaks” in the spectrum can be traced back to the atomic environments from which core electrons are being excited, giving information about the atomic structure of the material. This is illustrated in the featured image of this post (the one at the top of this page).

What makes XPS so attractive for computational materials physicists and chemists like us is that it provides a direct link between simulation and experiment that can be exploited to 1) validate computer-generated model structures and 2) try to work out the detailed atomic structure of an experimental sample. 1) is more or less obvious; 2) is motivated by the fact that experimental analysis of XPS spectra is usually not straightforward because features that come from different atomic environments can overlap on the spectrum (i.e., they coincidentally occur at the same energy). In both cases, computational XPS prediction requires two things. First, a computer-generated atomic structure. Second, an electronic-structure method to compute core-electron binding energies.

While candidate structural models can be made with a variety of tools (a favorite of ours is molecular dynamics in combination with machine-learning force fields), a standing issue with computational XPS prediction is the accuracy of the core-electron BE calculation. Even BE calculations based on density-functional theory, the workhorse of modern ab initio materials modeling, lack satisfactory accuracy. A few years ago my colleague Dorothea Golze, with whom I used to share an office during our postdoc years at Aalto University, started to develop highly accurate techniques for core-electron BE determination based on a Green’s function approach, commonly referred to as the GW method. These GW calculations can yield core-electron BEs at unprecedented accuracy, albeit at great computational cost. In particular, applying this method to atomic systems with more than a hundred atoms is impractical due to CPU costs. This is where machine learning (ML) can come in handy.

Four years ago, around the time when Dorothea’s code was becoming “production ready”, I was just getting started in the world of ML potentials from kernel regression, using the Gaussian approximation potential (GAP) method developed by my colleagues Gábor Csányi and Albert Bartók ten years before. I also had prior experience doing XPS calculations based on DFT for amorphous carbon (a-C) systems. Then the connection was clear. GAPs work by constructing the total (potential) energy of the system as the optimal collection of individual energies. This approximation is not based on physics (local atomic energies are not a physical observable) but is necessary to keep GAPs computationally tractable. However, the core-electron BE *is *a local physical observable, and thus idally suited to be learned using the same mathematical tools that make GAP work. In essence, the BE is expressed as a combination of mathematical functions which feed on the details of the atomic environment of the atom in question (i.e., how the nearby atoms are arranged).

Coincidentally, it was also around that time that our national supercomputing center, CSC, was deploying the first of their current generation of supercomputers, Puhti. They opened a call for Grand Challenge proposals for computational problems that require formidable CPU resources. It immediately occurred to me that Dorothea and I should apply for this to generate high-quality GW data to train an ML model of XPS for carbon-based materials. Fortunately, we got the grant, worth around 12.5M CPUh, and set out to make automated XPS a reality.

But even formidable CPU resources are not enough to satisfy the ever-hungry GW method, so we had to be clever about how to construct an ML model which required as few GW data points as possible. We did this in two complementary ways. One the one hand, we used data-clustering techniques that we had previously employed to classify atomic motifs in a-C to select the most important (or characteristic) atomic environments out of a large database of computer-generated structures containing C, H and O (“CHO” materials). On the other hand, we came up with an ML model architecture which combined DFT and GW data. This is handy because DFT data is *comparatively* cheap to generate (it’s not cheap in absolute terms!) and we can learn the *difference* between a GW and a DFT calculation with *a lot* less data than we need to learn the GW predictions directly. So we can construct a baseline ML model using abundant DFT data and refine this model with scarce and precious GW data. And it works!

Four years and many millions of CPUh later, our freely available XPS Prediction Server is capable of producing an XPS spectrum within seconds for any CHO material, whenever the user can provide a computer-generated atomic structure in any of the common formats. Even better, these predicted spectra are remarkably close to those obtained experimentally. This opens the door for more systematic and reliable validation of computer-generated models of materials and a better integration of experimental and computational materials characterization.

We hope that these tools will become useful to others and to extend them to other material classes and spectroscopies in the near future.

]]>Li is relatively scarce on the Earth’s crust and the mid-to-long term supply of Li needed to cover the rapidly increasing demand for Li-ion batteries is in jeopardy. The Li-intercalation process could in principle also be applied to more Earth-abundant ions, like K and Na, thus providing a **reduction in the cost of ion batteries and ensuring future supply of raw materials**. These are required to scale up the use of ion batteries and to make it affordarble for domestic and industrial applications. Unfortunately, Na and K do not intercalate in graphite as favorably as Li does, with Na-intercalated graphite deemed thermodynamically unstable, and in all cases incurring strong dimensional changes between charge and discharge. These dimensional changes pose risks to the mechanical stability of the material and the device containing it, with the associated safety concerns.

Nanoporous carbons are an obvious alternative to graphite for ion intercalation because the pores, interstitial voids between disordered graphitic planes, can be made within a range of sizes, all larger than the usual interplanar spacing in graphite. Thus, **nanoporous carbons can in principle accommodate larger ions**, including Na and K, which motivates their study and structural characterization. However, as is often the case with amorphous and disordered materials, experimental characterization can be challenging, since common characterization techniques employed to characterize crystals cannot be used. In the case of nanoporous carbons, experimental characterization of pore sizes and shapes is very complicated.

With this background in mind, we decided to study the microscopic structure and mechanical properties of nanoporous carbon using state-of-the-art atomistic modeling techniques based on machine learning interatomic potentials. These techniques provide, for the first time, the required combination of accuracy and computational efficiency to study nanoporous carbons where the size of the simulation box does not constrain the size of the pores that can be studied. Our results are now published in Chemistry of Materials:

**Y. Wang**, Z. Fan, P. Qian, T. Ala-Nissila, **M.A. Caro**; *Structure and Pore Size Distribution in Nanoporous Carbon*. Chem. Mater. (2022). Link to journal’s website. Open Access PDF from the publisher.

We started out by training a Gaussian approximation potential (GAP) for carbon based on the database developed by Deringer and Csányi [Phys. Rev. B 95, 094203 (2017)]. This new potential [10.5281/zenodo.5243184] achieves better accuracy and speed than the earlier version and can accurately predict the defect formation energies in graphitic carbon.

Getting the relative formation energies right is critical for obtaining the correct topology of the complicated network of carbon rings within curved graphitic sheets. In particular, the relative abundance of 5-rings and 7-rings will determine the curvature and thus pore morphology in the material.

With this new potential, we carried out large-scale simulations of graphitization with the TurboGAP code developed in our group, using a melt-graphitize-anneal protocol, akin to that by de Tomas et al. [Carbon 109, 681 (2016)], but now with larger systems (more than 130,000 atoms) and the accuracy provided by the new GAP.

With these simulations we managed to generate realistic nanoporous carbon structures within a wide range of mass densities (0.5 to 1.7 g/cm^{3}), and characterized in detail their short-, medium- and long-range order. For instance, these simulations reveal hexagonal motifs to be the dominant structural block in these materials (as expected) followed by 5-rings, then 7-rings and, in much smaller quantities, larger and smaller ring structures, with almost no density dependence for the most common motifs.

The pore sizes, the main target of this study, show clearly defined unimodal distributions determined by the overall mass density of the material. This means that the pore sizes and morphologies are relatively homogeneous for a given sample.

Finally, a useful result of our study is a library of nanoporous carbon structures freely available to the community and amenable to future studies on the properties of this interesting and important class of carbon materials.

This study would have not been possible without the hard work and dedication of our PhD student Yanzhou Wang and the help of the other coauthors, as well as the support provided by the Academy of Finland and the CPU time and other computational resources provided by CSC and Aalto University’s Science IT project.

]]>Adding dispersion (or van der Waals [vdW]) corrections to density functional theory (DFT) has been a very active area of research in the past 10-15 years. DFT is a mean field theory and dynamical effects, such as the effects of fluctuating charge distributions on energy and forces, are notoriously missing. The leading term in these vdW corrections is the London dispersion force, which decays as the sixth power of the interatomic distance (*V _{ij} *∝

The first vdW correction scheme to achieve widespread adoption was the D2 method, due to Stefan Grimme (Grimme, J. Comp. Chem. 27, 1787 [2006]). In this method, the effective *C*_{6} coefficient that parametrizes the London dispersion formula is given as a function of the atomic species and a *damping function* was introduced to switch off the dispersion correction at short interatomic distances. Soon, more sophisticated approaches arose that improved the accuracy of vdW corrections, by incorporating both 1) more information about the dependence of the effective *C*_{6} coefficient on the local atomic environment and 2) improved damping functions that bridge the transition between the long-range (London-type) correlation energy and the short-range DFT correlation energy. The Tkatchenko-Scheffler correction scheme (Tkatchenko and Scheffler, Phys. Rev. Lett. 102, 073005 [2009]) is one such approach, which relies on an “atom in a molecule” approximation and relates the effective *C*_{6} coefficients and damping length scales to the Hirshfeld partitioning of the charge density.

In parallel to these efforts on dispersion-correction schemes, the community has made huge advances in the past 10 years on machine learning interatomic interactions, usually from DFT data. These machine learning (ML) force fields are normally based on a local (atom-wise) decomposition of the total energy, to keep the simulation problem tractable. This is fine for strong covalent and repulsion interactions, which are typically short ranged. For instance, in carbon materials the covalent (“bonded”) part of the total energy of an atomistic system can be accurately learned with local atomic descriptors that are “blind” beyond 4-5 Angstrom. But vdW interactions are long ranged and, depending on the level of detail one aims at capturing, the “local” atomic environment relevant to vdW interactions is of the order of 15-20 Angstrom. This is bad news, because the computational cost of ML force fields scales as the cube of this distance, known as “cutoff”. This means that a ML force field with a 20 Angstrom cutoff is approximated 64 times as expensive, computationally, as another ML force field with a 5 Angstrom cutoff. Less obvious, but equally severe, limitations of “brute forcing” the learning of long-range interactions include the explosion of the size of configuration space with the cutoff, which requires exponentially more data for training these models.

Making ML potentials two orders of magnitude slower is an unacceptable tradeoff for including vdW corrections. The approach we, and also others before us, took in our recent paper is to machine learn the Hirshfeld volumes, which are a function of the local atomic environment with locality similar to the covalent part of the total energy. Then, these Hirshfeld volumes are used to parametrize the Tkatchenko-Scheffler implementation of the London dispersion equation, which is computationally cheap to evaluate. In addition, we took care of efficiently coupling the covalent and vdW parts of the calculation, such that the overhead due to the Hirshfeld volume computation is very small, and the overall vdW-enabled force field is, for typical calculations, only between 20% and 50% more expensive than the force field without vdW corrections. A happy consequence of our ML implementation is that forces can be computed *more accurately than the reference method*, because the gradients of the Hirshfeld volumes can be computed generally with ML, whereas the DFT implementation cannot easily compute these terms (which are typically missing from DFT calculations). We implemented these corrections in the GAP and TurboGAP codes.

As a proof of concept, we trained a new GAP for carbon and fine tuned it specifically to simulate C_{60}, a carbon material entirely made of C_{60} molecules. We could simulate large systems with thousands of atoms for relatively long MD times, and could chart the metastable high temperature/high pressure phase diagram of C_{60}, observing the transformations to graphitic carbon and amorphous carbon taking place in this material under specific thermodynamic conditions (see feature image at the top of this post).

This work is the culmination of a long effort by our group’s PhD student **Heikki Muhli**, that started more than 2 years ago during his MSc thesis project and continued as part of our COMPEX project, funded by the **Academy of Finland**, with CPU time provided by CSC and Aalto University’s Science IT project. It is also part of our continuous effort to develop and improve the **TurboGAP** code, which we hope to be able to launch within the next few months (the code’s development version is available online for the brave). For this work we collaborated with our group’s other PhD student **Patricia Hernández-León**, Aalto Univerty’s **Xi Chen** and **Tapio Ala-Nissila,** and GAP authors **Albert Bartók** (Warwick) and **Gábor Csányi** (Cambridge).

And this is just the beginning. Newer vdW correction schemes worry about the role of “many-body” physics on vdW corrections, where the leading effect is how many-body interactions affect the effective *C*_{6} coefficients (Otero-de-la-Roza, LeBlanc and Johnson, Phys. Chem. Chem. Phys. 22, 8266 [2020]). One of these many-body dispersion methods is the so-called “many-body dispersion” method (MBD; *early bird gets the worm*), which also feeds on Hirshfeld charge density partitioning (Tkatchenko, DiStasio Jr., Car and Scheffler. Phys. Rev. Lett. 108, 236402 [2012]). With our new infrastructure to couple GAPs to simpler, locally parametrized, force fields, we will be able in the near future to incorporate many-body dispersion effects, as well as other long-range interactions, such as electrostatics.

This server will complement the amazing CSC resources that we have been using so far from the Puhti supercluster, which has a “hugemem” partition with 12 nodes equipped with 768 GB of RAM and another 12 nodes equipped with 1.5 TB of RAM. This means that our new SUMO server will allow us to use twice more training data, plus reduce queue times since it’s a machine fully dedicated to our group. For optimal performance, we will combine Puhti resources with SUMO, where final fits incorporating as much data as we can fit within 3 TB of RAM will be run on SUMO, and both systems will provide us with plenty of CPU power and flexibility during the database and potential development stages.

Today, **Ivan Degtyarenko**, who is IT Specialist at Aalto University, gave **Jan Kloppenburg** and myself a tour of the facilities where SUMO-I will reside physically. SUMO-I will be housed at the CSC headquarters in Keilaniemi (next door from the Aalto Otaniemi campus) and integrated into the **Triton cluster**. It will be managed by the Science-IT project at Aalto University, which means we will have access to their existing HPC infrastructure (fast network, scratch filesystem, expert maintenance and support, etc.).

The acquisition of SUMO has been made possible thanks to the financial support from the Academy of Finland, and will give our group an edge for the development and deployment of accurate and fast machine learning interatomic potentials. We’re looking forward to start burning CPU time!

]]>This paper is the product of **a lot** of work, spanning three and a half years, on identifying the growth mechanisms (yes, mechanism**s**) and characterizing the structure of amorphous carbon (a-C) throughout different densities. I thought it would be suiting to give a summary of how our simulations have contributed to understanding a-C, and how machine learning (ML) potentials have played a pivotal role in reaching our current level of understanding, beyond what was possible before ML simulation made an appearance in the arena of molecular and materials modeling.

Amorphous carbon is a disordered metastable form of elemental carbon (although it can also be doped, intentionally or unintentionally, with other elements, most notably hydrogen). As a testament to the incredible flexibility of C to form chemical bonds (which is at the root of the sheer complexity of organic molecules and life itself), a-C is made up of a mixture of C atoms with different environments: *sp* (as in acetylene), *sp*^{2} (as in graphite) and *sp*^{3} (as in diamond), depending on how many neighbors each C atom has. This material is of high interest in research and industry because its mechanical and electronic properties can be tuned between those of graphene/graphite and diamond, by adjusting the *sp*^{2}/*sp*^{3} ratio.

The structure of a-C (both the atomic and electronic structures, actually), has been under debate since the 1970s-1980s. In particular, scientists have been intrigued by how the high-density form, also referred to as tetrahedral a-C (ta-C) attains a diamondlike structure. This time frame corresponds to the early days of molecular modeling, and thus a-C has been a target for all sorts of computational studies since the 1970s and 1980s. As an anecdote, one of the first (if not the first, and also highly cited) paper on the atomic and electronic structure of a-C, based on tight-binding calculations, was coauthored by **John Robertson** (the a-C guru, whose review paper is a reference manual in the field, albeit a bit outdated now) and my PhD supervisor **Eoin O’Reilly**, while they were working in Cambridge on a-C back in the 1980s (I asked Eoin and he confirmed he was working on a-C already before I was born). What is funny about this fact is that I only started working on a-C once I left Eoin’s group and came to Aalto University in 2013.

At Aalto, the work at **Tomi Laurila**‘s group focused (and still does) on making electrodes coated with a-C for detection of biomolecules. For this application, understanding the surface structure and chemistry of a-C is very important. Back in 2013, the simulation work was done in collaboration with **Olga Lopez-Acevedo**, and **Rémi Zoubkoff** was the postdoc doing the heavy lifting on a-C modeling when I arrived. Rémi was trying to use tight-binding (TB) molecular dynamics (MD) for melt-quench simulation of a-C, a simulation method where a-C is generated by MD, by rapidly cooling down (quenching) a liquid C sample. He had lots of trouble with his approach because of 5-fold coordinated complexes (5-c) predicted by TB (funny in retrospect, since we spent so much time dealing with characterization of 5-c environments in the new paper). Since also DFT-based melt-quench simulations had trouble with the holly grail of ta-C modeling, predicting very high (> 80%) *sp*^{3} fractions for high-density samples, I started my postdoc by doing some DFT-based generation of ta-C structures using a different method, based on geometry relaxation followed by pressure correction. Those simulations gave pretty good results for the structure of ta-C, in comparison to experiment (see this paper and this paper), but still did not resolve the issue of how ta-C grows to be similar to diamond. Experimentally, ta-C is not grown by melt-quench, but by deposition (atoms get thrown at a substrate, using a cathodic arc or some other experimental apparatus). But those simulations, which were carried out first by **Nigel Marks** in 2005 with his carbon version of the EDIP potential, were completely out of reach for DFT, because of computational costs. And unfortunately Nigel’s simulations failed to reproduce the high *sp*^{3} fractions observed experimentally in ta-C films.

So we’re now at an impasse: DFT is too expensive to do deposition, but simulation of the deposition process would be the only way to elucidate the growth mechanism. So I gave up and moved away from the surface side, started looking at the electrolyte side of the electrochemistry problem (that’s when I got interested in the 2PT method and free energy calculations, see this and this papers), and forgot about the structure of a-C for a while. But then, when I was working on the computational and theoretical part of our carbon materials review, in 2016-2017, I came across a new (to me) method based on machine learning to model the interatomic interactions in carbon. There was an arXiv preprint (now also in Physical Review B) by **Volker Deringer** and **Gábor Csányi** on a so-called *Gaussian approximation potential* (**GAP**) for a-C. I was preparing a comparison between different simulation methods for the review (see below), and Volker’s paper was missing some detail I was interested in (I think it was bulk moduli).

So I sent an email to Volker and he replied with the information I was after, but he also told me that he would be coming to Aalto for a conference in early 2017, and why not discuss in person about this a-C simulation business. Volker’s talk at the conference and a chat with him at the canteen afterwards was the first I heard about ML potentials, and their ability to accurately deal with interatomic interactions at a fraction of the computational cost of DFT. I got super excited about it, and proposed during our chat to do deposition simulations of ta-C with the new GAP. He would not say, but I am pretty sure from his expression that Volker thought this was a completely crazy idea. Mind you, while a lot cheaper than DFT, GAP simulations are still significantly expensive. Fortunately, in Finland we have excellent high performance computing (HPC) resources for research, provided by CSC.

So when Volker returned to Cambridge he sent me the files and showed me how to use GAP in combination with LAMMPS. I then started doing the deposition simulations at three different energies: 20, 60 and 100 eV. They progressed incredibly slow on CSC’s former supercluster Taito (now replaced by Puhti). I then moved them to Sisu, CSC’s former supercomputer (now replaced by Mahti), and got better scaling. But still, these simulations progressed incredibly slow, because deposition is intrinsically sequential (one deposition event followed by another). One needs to run the impact event with small time steps, since the incident atom is initially traveling so fast, and then the excess kinetic energy needs to be removed from the substrate by equilibration. And repeat. *Many* times. One example of this process is shown in the video below.

I went to visit Volker and Gábor in Cambridge during the summer of 2017, while these calculations where still running (painfully slowly), to discuss ta-C modeling. I remember getting excited every day around that time, as the bulklike portion of the film kept forming and it looked like we were going to hit the previously unattainable 90% mark… It took 3 months of continuous runs on Sisu (and about 3M CPUh) to get these films to grow. You can watch them grow in the movie below. More videos of deposition at different energies and the resulting structures are available from Zenodo.

We submitted this paper to *Physical Review Letters* in late 2017 and it got accepted in 2018 with glowing reviews (two “publish as is” and one “minor corrections”). You can access the paper here (or on arXiv, if you don’t have an APS subscription). You can also check the synopsis written by APS on our paper. The most significant aspect of this paper is that it settled the question of how diamondlike a-C grows, i.e., following the **“ peening” mechanism**, instead of the widely accepted “

After this first deposition paper, which focused on explaining the growth of high-density ta-C, we started working on the surface chemistry of a-C, which led to four *Chemistry of Materials* papers (structure1, structure2, x-ray1, x-ray2; I have previously written a blog post on the x-ray spectroscopy papers in this website). But we also kept alive the flame of understanding the growth and structure of a-C throughout the full range of mass densities. Low-density nanocarbons are very interesting at the moment in the context of energy storage, since they are porous, and other compounds can be stored in those pores. Unfortunately, all the stuff going on with the surface chemistry of a-C and other developments in ML potentials, teaching, event organization, not to mention trying to secure research funds and make career advancements, meant that finalizing the work on deposition of a-C progressed slower than expected.

However, all is good that ends well, and we have finally managed to publish our comprehensive simulations, which characterize the structure and growth mechanism of a-C from low (graphitic-like) to high (diamond like) densities, as shown above and below. At low energies, a-C grows by “direct attachment”, whereas at high energies it grows by peening.

And besides the implications for carbon science in general, and a-C knowledge in particular, one of the most significant aspects of our work is that it showed that the new ML potentials can be used to solve outstanding problems in molecular and materials modeling, previously out of reach due to computational limitations.

]]>The availability of new piezoelectric materials compatible with silicon chip integration for micro-electromechanical systems (MEMS) application is a highly attractive prospect. These new materials will help to bridge the gap between mechanical and electronic devices, making MEMS increasingly small and efficient. AlN is today’s industry’s standard and research is intensifying worldwide on AlN derivatives such as ScAlN. By alloying AlN with Sc, the crystal lattice is locally distorted due to the phase competition between the rock-salt ScN and wurtzite AlN structures, resulting in a progressive transition of AlN from wurtzite into a hexagonal-layered structure as the amount of Sc dopant atoms increases. This, in turn, induces an enhancement of the piezoelectric coefficients of ScAlN up to 50% Sc content (see figure).

One of our research interests is to apply different computational techniques to discover new piezoelectric materials with enhanced piezoelectric coefficients. We have started with the exhaustive characterization of ScAlN, already published. After a break in this front, we are looking forward to starting a new international collaboration in 2020, including industry partners, on discovery of new piezoelectric materials (details to follow).

]]>