I have been working as a data scientist in the Department of Anaesthesiology in the UMCG since 2009. This fits in very well with my background in artificial intelligence. The work of a data scientist is unknown territory for many people. It may not be a profession that scores well at parties, but in my opinion it is an important link in the research cycle. Especially in these times, where the magnifying glass is increasingly being applied to the correctness of data.
As one of the largest departments in the hospital, Anaesthesiology is involved in a large proportion of all UMCG patients. There are five research groups that all look at the data needed for research from a different angle. For our care, evaluation and research, we draw on data from various sources, in multiple frequencies and different degrees of "cleanliness".
Our team operates according to a clear vision: consult the expert and do not reinvent the wheel. This vision also applies to data. We believe that the researcher should focus on his/her core task: delivering care and research. Researchers are often not trained in dealing with complex data flows and misjudge the mistakes that you, as a human being, are bound to make in the analyses. And yes, doctors are people too. I always compare it to giving an anaesthetic. You can teach me to prick a patient, but that does not make me an anaesthesiologist.
The diagram below nicely shows the areas in which a data scientist operates. I would like to briefly describe some of these areas in the order in which they often appear in my work.
(Source: Data Science Partners)
A data request begins with the preparation of academic research. After approval of the research, in consultation with the research coordinator, researchers come to me to draw up a data plan. In this plan, we check whether this is exactly the data needed to answer his/her question. We map out which databases the data has to come from and in what form it has to be delivered. Data for anaesthesia research often comes from various types of measuring equipment, such as pumps, ventilators and vital parameter monitors. Sometimes this means that specialist software must be created to read out new equipment. This is also an important part of my work in collaboration with Medical Technology.
Business knowledge / domain knowledge
During a meeting with the researcher, it is important that I also have some basic knowledge of the field of anaesthesiology. This helps me to think about how the data will be collected, whether certain impurities in the data must be taken into account and ultimately in what form the data must be returned to the researcher. Knowledge of the domain is indispensable for linking the right data to the questions that the researcher wants to answer.
Data retrieval and processing
Once it has been determined exactly which data should be made accessible, it is important to 'lock' the source data. This means that it is decided that the unlocked data is the data on which the analyses will be done. No more data will be added. Validated algorithms extract the necessary data and each step in this process is recorded. This makes it possible to show, in the event of an audit, exactly how the extraction of the source data took place. This validation can take place in all steps from source data to final publication. In this way, it is impossible to cheat with the data and errors are reduced to a minimum.
Mathematics, Statistics and…Ethics
It is often agreed in advance whether the researcher will do the analysis himself or whether I will take care of this. Especially for more complex data questions, it is often more efficient to let the data scientist do the analysis. This reduces the chance of errors, increases the repeatability and ensures that we can work in parallel.
During the analysis phase, there is a lot of consultation with the researchers. After extraction, I always start by visualising all the data. If it looks like what we expect, we proceed to carry out statistical tests. During these steps, I can make less use of validated methods, as each research question requires a different approach. For each research question, I create a specific algorithm with visualisations/tests. These methods then become part of the dataset. This way, it is always possible to check afterwards which steps were taken to reach the final conclusion.
As mentioned earlier, domain knowledge is very important, but the challenge is not to have too much domain knowledge. The expression "Ignorance is bliss" from the film The Matrix also applies to my work as a data scientist. I try not to make any value judgments about the actual outcome of the research. I am merely the messenger of the (sometimes) bad news. According to good scientific principle, the academic cycle must then begin again, however difficult that may sometimes be. It is my job to stick as closely as possible to the data plan. All this to prevent so-called fishing experiments.
The motto of our team is therefore: the data = the data.