The Most Common Forms of Research Data Fraud and How to Detect Them
The foundation of the open science movement rests on the belief that embracing open sharing practices will lead to wider accessibility of research findings and accelerate the exchange of knowledge. However, openness of research results itself is not enough, as it does not guarantee their reproducibility and reusability. Open data must be annotated with detailed metadata and equipped with all supporting documents and computer code, which allow research to be repeated. Fundamentally, the data must be accurate and credible, i.e., obtained with the greatest degree of research integrity. Unfortunately, the reality is often different. Scientific publications and research data should not be blindly trusted, regardless of whether they were published in open access or behind a paywall. In this article, we will examine the forms of data fraud that warrant heightened awareness and explore the methods for their detection.
Frequency of Research Integrity Violations and Their Causes
Since the beginning of the 21st century, there has been a significant increase in retractions of scientific publications, reaching around 1400 per year according to Ivan Oransky, co-founder of RetractionWatch. Specialists in detecting fraud with images in biomedical articles, such as Elisabeth Bik and Enrico Bucci, estimate that approximately 4% to 6% of these articles may contain cloned or manipulated image material. These estimates were made before the widespread adoption of generative artificial intelligence tools like ChatGPT and Midjourney. The potential impact of these tools on research integrity violations is yet to be fully understood.
Modern science is grappling with a crisis of research result reproducibility, which goes beyond intentional fraud and stems from various factors. A well-known meta-analysis focused on psychology research revealed that only approximately 40% of the analyzed results could be successfully reproduced. More concerning figures emerge from industry studies, where researchers at Bayer were able to replicate only about a quarter of preclinical studies, and Amgen researchers even less, only about 11 %. Additionally, a significant portion of clinical studies, at least 44%, contains errors in the data, and in 26% of cases, the errors are so prevalent or severe that the findings are impossible to trust.
In 2016, Nature magazine conducted a survey of 1576 researchers to identify the primary causes of the reproducibility crisis. The top three responses were selective reporting, pressure to publish, and low statistical power, including incorrect analysis of results. Marcus Munafò from the University of Bristol suggests that the modern academic culture, which prioritizes novelty and breakthroughs, has discouraged replication studies, publication of negative or null results, and transparent practices, thus undermining the significance of confirmations, repetitions, and transparency. To address this issue, potential solutions span a wide range of measures, such as altering reward and incentive systems, implementing stricter quality control before publication, and reevaluating existing research practices.
Types of Research Data Fraud
Research data fraud can be roughly divided into the following five groups:
Fabrication refers to the act of generating false data and presenting it as genuine research findings. It represents the most severe form of scientific fraud, which has lead to notable retractions of scientific publications. According to a meta-analysis conducted by Daniele Fanelli from the University of Edinburgh in 2009, the frequency of fabrication and falsification of data among researchers was estimated to range from 0.3% to 4.9%, with a weighted average of 1.97% (he considered both phenomena together). However, since the data was based on self-reporting by researchers and their perceptions of others' behavior, the actual numbers could potentially be higher.
In recent years, there has been an increase in fabricated scientific publications originating from what are known as scientific paper mills. These companies, based in China, Russia, and Iran, engage in the sale of scientific articles and co-authorships to researchers. The widespread availability of generative artificial intelligence tools has further accelerated and facilitated their operations, a trend that was already on the rise before 2022. One potential solution to curb the proliferation of fabricated publications could involve the mandatory submission of raw data alongside the draft of the publication during the peer review process. However, implementing such measures requires time, as it necessitates scientific publishers to adapt their digital infrastructure to effectively handle research data.
Falsification encompasses any intentional alterations made to genuine data, resulting in them portraying a different situation than the actual reality. Examples of such actions include:
- removing part of the data from images by retouching and cropping;
- changing numerical values (e.g., excessive rounding);
- combining data from unrelated experiments and displaying them as a single data set;
- adding or removing data points on graphs;
- the use of parametric statistical tests where non-parametric ones would be needed (e.g., for small samples);
- removing outliers from statistical analyses;
- p-hacking, etc.
Some refer to this type of fraud as "beautification", particularly when the data are selectively chosen to enhance clarity and distinctness, leading to a simplified and clearer interpretation. Another term used in this context is "cherry-picking", where only data that align with specific ideas or preconceived conclusions are chosen, whereas any conflicting data are disregarded.
Misrepresentation refers to the act of interpreting research data in a manner that deviates from their true meaning, regardless of whether the data are complete and accurate or not. Examples of this type of fraud include:
- exaggerating the importance or impact of research findings, including the rationalization of negative results (JARKing);
- stating or alterating hypotheses after the results are already known (HARKing);
- statistical extrapolation without taking into account that the reliability of the forecast decreases with distance from the interval of known values;
- interpreting correlation as causation;
- disregarding confounding factors/variables;
- interpreting the p-value as a measure of effect, when in fact it is only a measure of the probability with which a particular outcome occurs;
- manipulation of scales on graph axes;
- recycling the same images in different articles to substantiate different conclusions at different times, etc.
Misrepresentation can also occur when scientific findings are presented to the general public in layman's terms, which is sometimes referred to as "spinning".
Plagiarism refers to the act of using others' ideas or works without proper citation or authorization from the original authors. While commonly associated with texts, plagiarism can also extend to research data, particularly when combined with plagiarized pictorial material. This unethical practice is frequently observed in scientific publications originating from scientific paper mills.
In an academic context, sabotage refers to actions aimed at impeding colleagues' research work and its quality. Such actions may include intentionally omitting information from research protocols to render them less useful for others, including procedures described in scientific articles. Sabotage may also involve the destruction of one's own or others' research data, etc.
Statistical fraud is among the most prevalent forms of scientific fraud since it can be executed on genuine, correctly acquired research data, making it harder to detect than outright fabrication. It mostly belongs under the umbrella of data fabrication or misrepresentation. Detecting such fraud demands an extensive understanding of statistics, as it may escape the notice of the untrained eye. It is essential to note that not all statistical errors stem from deliberate fraud; many can be attributed to ignorance or poor practices perpetuated through generations due to systemic issues in the scientific community.
Before delving into the most common types of statistical errors, let's first establish the criteria for determining when statistical analyses are accurate. Karen Grace-Martin, a statistician and founder of The Analysis Factor, identifies two conditions in an article titled What makes a statistical analysis wrong? These conditions are as follows:
- the statistical test is appropriate for the given assumptions (takes into account the measurement scale of the variables, the design of the study or experiment, and the properties of the data) and
- the statistical test is able to answer a given research question.
Navigating through either of these two conditions can lead to complications, making it crucial to meticulously plan statistical analyses, both in theory and practice.
In 2019, Tamar R. Makin from University College London and Jean-Jacques Orban de Xivry from KU Leuven compiled a list of the ten most frequent statistical errors that arise from inefficient experimental design, inappropriate analyses, or faulty reasoning in their article Ten common statistical mistakes to watch out for when writing or reviewing a manuscript. While these errors may not always constitute deliberate fraud, they can still significantly compromise the quality of research outcomes if left unaddressed.
Stuart McErlain-Naylor from Loughborough University has creatively visualized the key content of the mentioned article in the video shown below, which can be an invaluable resource for comprehending the descriptions. Furthermore, Raghuveer Parthasarathy from the University of Oregon has offered additional explanations and illustrative examples for the errors outlined in the article on his blog The Eighteenth Elephant, which we will utilize in our interpretation.
1. Absence of suitable control conditions or control groups
If we aim to investigate the influence of a particular factor on a chosen variable, it is crucial to incorporate control conditions (absence of the factor) within the experiment or study. This is necessary because the variable's value can be affected not only by the factor of interest but also by other external circumstances. For instance, when examining the impact of sports activity on body mass, we must include individuals who are not physically active as a control group in the study. This ensures that any changes in body mass can be attributed to the sports activity itself, as variations in diet or other factors could also influence the variable.
In an ideal scenario, the control conditions (or control group) should mirror the experimental conditions in terms of design and statistical power, with the sole exception of the factor being studied. Negative and positive controls should also be differentiated, with the former being a control group subjected to no intervention, and the latter being exposed to a substitute factor, such as a placebo. To minimize bias, the test and control units should be sampled simultaneously and assigned randomly to both groups. This helps ensure that any observed effects are attributed accurately to the factor under investigation and not influenced by confounding variables.
Some examples of unsuitable control conditions are:
- the control group is not subjected to the substitute factor (absence of positive control),
- the control group is too small, so its statistical power is not sufficient to detect a change in the test group,
- the control group is subjected to different conditions than the test group, which can lead to biased comparisons,
- the experiment is not "single-blind" (the researchers know or can predict the outcome).
2. Interpreting comparisons between two effects without directly comparing them
Statistical testing commonly involves searching for "statistical significance" to assess differences between the test group before and after an intervention or correlations between two variables. For instance, if an intervention or observed factor results in a statistically significant difference in the test group compared to the baseline, while the change in the control group over the same period is not statistically significant, researchers may infer that the intervention or factor had a greater impact on the test group than on the control group. The same principle applies when comparing the outcomes of two different test groups.
Such a conclusion is flawed because it is the result of two separate tests instead of one direct comparison between the selected groups, which would show that there is no difference (Raghuveer Parthasarathy provides a good graphic). The risk of this error is also the reason that data from experiments with one control and multiple test groups should be analyzed with a method such as ANOVA, rather than with consecutive tests for two independent samples (t-test or comparable non-parametric test).
This error occurs because the cutoffs of the statistical significance (typically p < 0.05, or less than 5% probability that the outcome occurs by chance) are set arbitrarily and are not measures of effect. In addition, every measurement contains some amount of noise, whether due to measurement error, system variability, or some other source of randomness. Therefore, it is necessary to compare the groups in a way that takes into account the variability of the sample and not just the difference between the group mean and some default value. You can read more about this in the 2006 article The Difference Between “Significant” and “Not Significant” is not Itself Statistically Significant by Andrew Gelman and Hal Stern.
3. Inflating the units of analysis
An experimental unit is the smallest unit of observation that can be chosen randomly and independently. In classical statistics, the experimental unit reflects the degrees of freedom. Thus, in the analysis of group results, the experimental unit is the number of tested or observed entities rather than the number of measurements carried out on those entities. As the number of degrees of freedom increases, the threshold against which we evaluate "statistical significance" decreases (the statistical power of the test increases), which can lead to falsely significant results.
Raghuveer Parthasarathy gives the following example. In two groups of 20 people, three measurements of body mass per person are taken to assess whether the difference in average body mass between the groups is statistically significant (the actual data are provided in the article). The experimental unit in this case is 20, i.e., the number of people in the group, and not 60, i.e., the total number of measurements. Successive measurements do not contribute to the number of independent data points that are the basis for group comparison. If we test the data with a t-test at N = 20 (correct), we arrive at p = 0.20, which is not statistically significant; using N = 60 (wrong) gives p = 0.03, which is statistically significant.
4. Spurious correlations
Correlations are used in statistics to assess how strongly two variables are related. Spurious correlations most often arise from three causes:
- the data contain an outlier that is plausible but not representative of the general sample (Raghuveer Parthasarathy cites the example of spontaneous mutations in the Luria–Delbrück experiment);
- certain data points were removed from the data;
- the analysis does not take into account all variables, especially confounding factors;
- the data were created by combining two groups of results that differ in their characteristics.
Spurious correlations are best avoided by using statistical tests that are more robust than Pearson's correlation coefficient (Makin and Orban de Xivry give some examples) and are not as sensitive to outliers. Visualizing the results graphically is also highly recommended to ensure the clear visibility of the data point distribution.
5. Using (too) small samples
Utilizing small samples is a common practice in scientific research, often due to various constraints such as limited test subjects, restricted resources, or the complexity and cost of research procedures. However, working with small samples leads to at least five issues:
- stochasticity; as noted by Raghuveer Parthasarathy, in 10 flips of a coin, the probability of getting 6 heads in a row is 20%, while in 100 flips, the probability of 60 heads in a row is only 1%;
- increased probability of type I error (false-positive result); with small samples, the observed effects are large, which researchers often misinterpret as a statistical significance instead of an overestimated effect (Raghuveer Parthasarathy provides a good visualization);
- increased probability of type II error (false-negative result); as the number of sample units increases, the power of the sample, i.e., the probability of detecting an effect that is actually present, increases, while the probability of missing the effect decreases;
- deviations from a normal (Gaussian) distribution that complicate statistical testing (parametric tests that assume a normal distribution, such as the t-test, cannot be used, and non-parametric tests also have their limitations),
- the occurrence of extreme outliers, which can lead to the abovementioned spurious correlations.
Makin and Orban de Xivry suggest calculating the statistical power of the sample based on its size and repeating the experiments several times. Parthasarathy adds that waiting for more favorable research conditions, such as adequate funding, time, and suitable equipment, might be a better option than conducting studies with small sample sizes and drawing unfounded conclusions. Another approach is to proceed with the research but exercise caution in reporting and interpreting the results. This situation prompts a need to question the fundamental principles of scientific research and systemic influences, such as funding and the pressure to publish.
6. Circular analysis
Makin and Orban de Xivry define circular analysis as the retrospective identification of certain characteristics of the data as dependent variables, thereby distorting the results of a statistical test. One of the more common forms of circular analysis is dividing the data into groups after the end of the experiment, when the differences in the data are already visible, and removing part of the data (e.g., outliers). An example would be the observation of a group of sample units that, before and after exposure to the test factor, as a whole shows no change in the measured variable. However, after the experiment, it is observed that the value of the variable decreased in the case of some units, and increased in the case of others, so the test group is deliberately divided into two subgroups according to the change in the value of the measured variable. The results are then shown either as a consequence of the original design of the experiment or as a correlation with some third variable that separates the groups from each other and shows a strong statistically significant effect. In this way, the statistical significance is actually derived from the noise.
The desired p-value (typically p < 0.05, i.e., less than 5% probability of obtaining a certain result if the null hypothesis is true) is very easy to achieve, such as by adjusting statistical analyses, eliminating outliers, dividing test units into subgroups, adding new test units from repeated experiments, etc. The p-value is actually a random number and fluctuates with the fluctuations in the data. Raghuveer Parthasarathy recommends some in-depth resources on this topic:
- Kerr et al., 1998: HARKing: Hypothesizing After the Results are Known
- Simmons et al., 2011: False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant
- Gelman and Loken, 2013: The garden of forking paths: Why multiple comparisons can be a problem, even when there is no“fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time
- Head et al., 2015: The Extent and Consequences of P-Hacking in Science
As a possible solution, Makin and Orban de Xivry suggest using standardized analytical procedures and pre-registration of the experimental design and analyses. Parthasarathy adds to this the calculation of the probability that the analysis shows an effect even though the effect size is zero, or simply the use of other statistical methods (or scientific methods in general) that do not require calculating p-values.
8. Not applying corrections for multiple comparisons
Exploratory research is often aimed at checking the effects of different test conditions or factors on different variables (e.g. the influence of different combinations of light, water, and nutrients on the growth and flowering of plants). If the connections between test conditions and variables are tested using frequentist statistical methods through pairwise comparisons (as if each time there was a separate experiment and the conditions are independent of each other), we increase the probability of a type I error (false positive result). For example, in a study with 2 × 3 × 3 test units, the probability of obtaining at least one statistically significant result is as high as 30%, even if the effect size is zero. Perhaps the most illustrative example of this error was demonstrated by the famous MRI detection of brain activity in a dead salmon, which earned the 2012 IgNobel Prize.
Raghuveer Parthasarathy illustrates this principle with a simple example: the probability of getting a 6 when rolling a dice is 1/6 or 17%; the probability of getting a 6 at least once when rolling 6 dice simultaneously is significantly higher, i.e. 67%. An even greater problem arises if any of the tested variables are correlated. Parthasarathy also points out that the problem is related to p-hacking.
Makin, Orban de Xivry, and Parthasarathy agree that the only solution to this problem is to use corrections for multiple comparisons.
9. Over-interpreting non-significant results
In frequentist statistics, the arbitrarily determined value p < 0.05 is usually used as the threshold of "statistical significance". A common conclusion that researchers draw from a p > 0.05 result is that there is no effect, although the obtained result may simply be a consequence of the sample size or inappropriate experimental design. A value of p > 0.05 may be the consequence of the actual absence of an effect (true negative result), low statistical power of the sample (see point 5) or an effect that is ambiguous or insufficient due to its low intensity. Raghuveer Parthasarathy provides a good visualization of this error. He also points out that the opposite situation, when researchers wrongly conclude that an effect is present based on a result of p < 0.05, is also very common. Makin and Orban de Xivry suggest that the effect size should always be reported along with p-values, and they also mention alternative approaches, such as Bayesian statistics.
10. Mistaking correlation for causation
This is one of the most common errors in the interpretation of statistical tests. If there is a correlation between two variables, it is easiest to conclude that the variables are also causally related. In reality, a correlation can occur due to an actual cause-and-effect relationship (direct or reverse), some common third factor affecting both variables (i.e., a confounding factor/variable) or coincidence. This phenomenon, which is actually more of an error in logical reasoning than a statistical error, is vividly illustrated by Tyler Vigen's Spurious Correlations blog. Raghuveer Parthasarathy warns that it is much more difficult to prove a true cause-and-effect relationship than correlation, and recommends Causal Inference: What If by Miguel A. Hernán and James M. Robins for more information on this topic. A preprint of the book is available on the authors' website.
What about the raw numerical data? By promoting the deposit of data, including unprocessed ones, in dedicated repositories, we can expect to occasionally encounter cases of falsification and beautification described above when using data from other research groups. Several statistical approaches can be used to verify the credibility of numerical data, for example:
- testing the distribution of leading digits according to the Newcomb-Benford law,
- testing multivariate associations between experimental variables,
- the GRIM (Granularity-Related Inconsistency of Means) test, which its author James Heathers has also described in layman's terms on his blog,
- the SPRITE (Sample Parameter Reconstruction via Iterative TEchniques) test,
- complex mathematical algorithms, etc.
Of the methods described, the oldest and simplest one is based on the Newcomb-Benford law, also popularly called the law of first digits. The Newcomb-Benford law describes the relative frequency of leading digits in randomly generated numbers, with lower-value digits occurring more often than high-value digits. About 30% of randomly generated numbers start with a 1 and less than 5% with a 9, which has been confirmed on many different types of data, from stock values, sports statistics, population parameters (e.g., mortality), and various types of financial data. When fabricating or falsifying numbers, people usually do not pay attention to this pattern, especially if they try to fit them to some predetermined values. The Newcomb-Benford law is routinely used in financial forensics.
One practical way to reduce numerical data fraud is the introduction of electronic lab notebooks to facilitate the sharing and verification of research results.
Several specialized investigators are actively engaged in detecting image fraud in scientific publications. Elisabeth Bik, who primarily focuses on biomedical publications, is perhaps the most renowned among them. She can detect many fraudulent manipulations with the naked eye, but at times, she also employs dedicated computer programs to assist in the process. Some of the most common manipulations she identifies include duplications, shifts, rotations, cloning, and mirroring of entire images or their parts. Bik frequently shares her discoveries on her website and on the platform for commenting on scientific publications, PubPeer. Additionally, she often delivers public lectures. Below is one of her recent lectures where you can learn about key aspects of image fraud to be vigilant about.
However, with the widespread adoption of generative artificial intelligence tools, Elisabeth Bik has also observed the emergence of unique images that are computer-generated and cannot be easily detected by software tools designed to identify duplicate images. In 2022, Gu and colleagues published an article titled AI-enabled image fraud in scientific publications, where they practically demonstrated that computer methods for detecting computer-generated images are currently no more effective than the discerning eye of a trained human. Both approaches, whether relying on software or human scrutiny, exhibit relatively limited reliability in detecting image fraud, regardless of the method used (editing genuine images, computer-generated original images, or computer-regenerated images based on a single genuine image) or the type of image (photograph, scanning electron micrograph, immunohistochemical samples, immunologically stained cell cultures, histopathological samples).
The tools listed on the website of the Humboldt-Elsevier Advanced Data and Text Centre (HEADT Centre) can help you detect fraud with pictorial material that was created before the mass adoption of generative artificial intelligence tools. The HEADT Centre is a research unit of the Humboldt University in Berlin dedicated to the research integrity of all types and formats of text and images and receives part of its funding from the scientific publisher Elsevier.
Last update: 27 July 2023