The American Chemical Society (ACS) defines analytical chemistry as:
The science of obtaining, processing, and communicating information about the composition and structure of matter. In other words, it is the art and science of determining what matter is and how much of it exists.https://www.acs.org/content/acs/en/careers/chemical-sciences/areas/analytical-chemistry.html
In the late 18th Century analytical chemistry was limited to gravimetric (i.e., weight) analysis and titrations using color-changing solutions. But with the development of computer systems, electronics, optics, and vacuum systems, analytical measurements are carried out very differently. Modern instruments collect more data in less time than ever before; thus, how the data is analyzed also needs to evolve. I will illustrate this point with an example.
In graduate school, I was asked to develop a mass spectrometry-based method to classify bacteria based on their mass spectrometric fingerprint. So, I developed a method to take bacteria from a petri dish where it was incubated, lyse the cells, and then spray the bacteria directly into the mass spectrometer. After analyzing all 78 samples, I could visually see each genus of bacteria had a distinct chemical fingerprint. That was a good indicator that I could rely on a pattern recognition algorithm to classify these bacteria. But then I ran into a major issue. Each bacterium fingerprint was represented as 1792 variables. As a result, there were over 139,000 data points when I compiled the data. These were far too many for the commercial software I was using. To keep the project on track, I partnered with a statistician to write a script to carry out the analysis I wanted to perform. But this still resulted in delays. Now, a few years into my data science journey, I see how essential it is to be able to programmatically explore and analyze one’s own data—especially for large multivariate datasets.
Although I would like to take credit for realizing this need first, there have been many publications going back several years that discuss this needed skillset within analytical chemistry and science more broadly. A review article by Dr. Ewa Szymańska titled “Modern Data Science for Analytical Chemical Data – A Comprehensive Review” nicely describes the benefit of numerous data science methods for modern analytical chemistry. I really like that she did not jump right to model building but discussed how data science can be used to filter outliers, clean data, enhance visualizations and drive decision making. In the past three years of my learning journey, I have applied many of these approaches in my data analysis pipeline.
I understand there is a tension. Learning how to program is challenging and maybe even annoying, especially in the early days where there is little to show for your efforts. Moreover, despite the shade many programmers throw at Excel, it remains one of the most popular software tools for data analysis, visualization and yes, data science according to Data-Flair. Between Excel’s wide availability and point-and-click user interface, I see the appeal. But it remains inadequate for processing large swaths of data. Many of the major instrument manufacturers have also noticed the need for multivariate data analysis tools. Now, some of them offer chemometric (i.e., data science but for analytical chemistry data) software add-ons as part of their software suite. I applaud this effort because it does make certain algorithms accessible to a scientist who would never want to write a line of code. But I still believe that more effective data analysis can be accomplished programmatically. Building on what Dr. Szymańska wrote about, I will address a few more benefits of programmatic data analysis.
Reproducibility is one of the core features of good science. Traditionally, we think about this from sample preparation to result generation, but reproducible data analysis is just as vital. When writing a program, you are telling a computer exactly what to do with your data. As a result of writing a computational procedure, reviewers can easily follow and replicate your processing script. On the other hand, using a spreadsheet-based approach is exceedingly difficult to follow especially if multiple data transformations occur.
Data Fusion is the act of combining multiple components to produce a more accurate outcome. This could mean combining mass spectrometry data with Raman or IR data to gain insight about a material characteristic. This may sound complicated but is remarkably simple with Python and R. However, due to file sizes this type of transformation is challenging with spreadsheet editors and is not offered as part of most if any instrument vendor chemometric packages.
For smaller datasets, spreadsheets tend to work very well for simple processing. However, current analytical instruments, especially high-resolution instruments, can produce greater than 10 thousand data points per sample. Due to the relatively inefficient memory usage, most spreadsheet editors simply cannot even hold the data without crashing. Here is where instrument software can be useful, especially for screening data quality without writing scripts. For more detailed data processing, well, you know what I’m going to say.
Another major benefit of data science is the improved data visualization. A good figure that is simple to understand but highlights key information is somewhat of a science and art. Spreadsheet editors often have many data visualization options, but they can be hard to use when trying to layer multiple encodings. Additionally, the visualization is often static so insight is limited to what can be gleaned from a fixed graphic. Programmatically creating figures allows for more dynamic rendering to better capture the essence of the data. Programs like R and Python also permit creation of interactive visualizations which can greatly enhance data storytelling and allow for deeper insights.
Since incorporating data science into my workflow my ability to answer tough questions with data has significantly improved and I know yours will too. My learning journey has been a combination of over 30 online courses on platforms like Coursera and EdX. I have also watched hours of YouTube tutorials. One of my favorite content creators is Kevin Markham, the founder of Data School. Lastly, I have read several books on the subject. I will talk more on my journey from having nearly no level of programming experience to being highly proficient in less than three years in a future post.
Modern analytical instruments require modern data analysis tools and as technology continues to improve, this becomes even more important. Every step of the data analysis pipeline can be enhanced by the tools mentioned in the post and should be strongly considered. Even if you are not ready to make the change, awareness is still important. As an analytical chemist by training, I have written this from the point of view of an analytical chemist, but these factors hold true for everyone collecting and analyzing data. I hope you have enjoyed the blog. See you in the next post!