The road to meaningful AI-driven data curation


Artificial intelligence (AI) is fueling innovation throughout the healthcare industry. Whether healthcare facilities are using AI to detect diseases earlier, assist in clinical decision-making, or identify patients suited for clinical trials, AI’s impact is becoming increasingly evident.

However, some often overlooked limitations highlight the importance of proceeding thoughtfully when exploring the use of AI in healthcare. Models that are improperly trained, fed low-quality data, or given tasks that exceed AI’s capabilities can lead to incorrect diagnoses and potentially harmful clinical decisions. During the COVID-19 pandemic, for example, errors in training or testing AI tools resulted in models that functioned improperly. In one instance, AI picked up on the fonts hospitals used to label scans, causing the model to falsely correlate fonts with predictors of COVID-19 risk. Diagnostic AI has also revealed biases, such as less accurate skin cancer diagnoses for patients with darker skin due to a lack of diversity in training datasets. Further, an evaluation of a sepsis prediction model found that the model failed to identify two-thirds of patients with sepsis—underscoring the dangers of relying on AI alone.

While AI holds massive potential to transform patient care, the industry is still in the early stages of exploring this technology. AI models cannot work effectively without high-quality clinical data. To develop accurate AI models, we must get the data right first.

This article discusses the value AI-driven data curation offers in hospital settings and clinical research. It explores common pitfalls associated with developing AI models for data curation, along with the factors necessary for successful AI implementation. Lastly, it shares why Q-Centrix is uniquely positioned to explore the use of AI to curate clinical data.

The potential of AI in healthcare

AI can support the processing of massive amounts of valuable yet unstructured information for a range of different purposes in healthcare, research, and beyond.

Unlock the value of clinical data

AI is poised to help hospitals derive greater insights from the vast amounts of data they have. Patients generate an average of 50 million gigabytes of data every year—and 97 percent of the clinical data hospitals possess go unused. When the vast majority of these data are unstructured, often taking the form of doctors’ notes, image scans, and other formats that require interpretation, making sense of this staggering amount of data is an impossible undertaking for any person or team—but an algorithm can be trained to quickly go through massive datasets and extract valuable insights.

Improve clinical research

AI also has the potential to make a significant impact in addressing clinical research challenges. Currently, nine in 10 drugs that reach the clinical trial stage fail to receive FDA approval due to challenges in the clinical trial process. Insufficient patient enrollment remains one of the biggest hurdles in clinical trials—and it’s the reason why 20 percent of cancer clinical trials fail. Finding eligible patients for a study often involves combing through electronic medical records (EMRs) and other information systems not built for clinical research purposes, which is very time-consuming.

Even research teams that manage to identify and enroll enough patients for a clinical trial may find that their sample is not representative of the general population. Many racial and ethnic groups are underrepresented in clinical research, emphasizing a need to improve diversity among clinical trial patients. With the aid of AI-powered tools to sift through data dispersed across various information systems and find patients that meet trial criteria, research teams may be able to conduct clinical research more efficiently. This can both greatly reduce the time and costs associated with patient recruitment and aid in increasing diversity in clinical trials.

Support observational studies

In addition to improving clinical research processes, AI can support research that relies on existing patient data, such as observational studies. These studies can be conducted using the real world data hospitals and health systems already have (such as data from electronic medical records, billing and claims data, and other sources of information).

Although unstructured data and inconsistent data preparation practices are common challenges associated with observational studies, using AI-enabled techniques to curate these data allows facilities to overcome these barriers and produce custom, high-quality, research-ready datasets. These datasets can be used for facilities’ internal research purposes or for funded opportunities in which healthcare facilities contribute data to retrospective studies for sponsors in the pharmaceutical and life sciences industries. As observational studies are less expensive to conduct than clinical trials—and can be completed much more quickly—AI driven data curation offers researchers a valuable, cost-effective, and efficient way to gather findings and advance medical research.

Observational studies can be conducted using the real world data hospitals and health systems already have.

Considerations in using AI for data curation

While AI excels in repetitive, straightforward tasks, such as scheduling appointments, using AI for data curation is a much more complex undertaking. Clinical terminology is highly nuanced and requires specialized expertise to be interpreted properly, and it can be challenging to provide an AI model with the scale and scope of diverse, accurate data it needs in order to be effective.

Common pitfalls in AI model development

Developing and training an AI model is not an easy feat. An improperly trained model is likely to have deep flaws due to biases or shortcomings in the data used to train it. This, in turn, can have untold effects on the model’s performance, accuracy, and utility in a clinical setting.

Some factors that can hinder an AI model’s ability to learn and function properly include:

  • Poor data quality. The standards for clinical data quality are very high, and high-quality data are essential for training and refining AI models to ensure their accuracy and reliability. Many failures in AI tools have been linked to the poor quality of data researchers have used to develop these tools.


  • Outdated data. New drugs and treatment pathways are developed every year, changing how medicine is practiced. Models trained on data from even a year ago would miss crucial insights from the rapid pace of innovation.


  • Documentation practices. Documentation practices vary from physician to physician and, on a larger scale, from facility to facility. A model trained on one physician's or one facility's data may not be applicable elsewhere.
  • Distinct EMR setups. Drastically different data capture practices occur not just across hospitals that use different EMRs, but even among hospitals that use the same EMR. Customized configurations, differing data entry protocols, or unique workflow integrations greatly alter how patient data are entered and stored.


  • Differences across care settings. The setting of care, whether inpatient, outpatient, academic, or community-based, introduces additional layers of variability. Each care setting may have specific requirements, workflows, and priorities that dictate data capture practices.

Organizations must proceed cautiously when developing AI models for data curation. When AI models lack the rigorous training needed for safe and effective data curation, any number of repercussions can result. For example, training AI models on low-quality data can create data cascades—defined as compounding events causing negative, downstream effects—that can lead to unreliable results or harmful effects.

Elements for successful AI implementation

Automation alone is not enough to ensure accurate, efficient data curation. Automation is one component on a continuum of actions that drive efficiency incrementally. To that end, AI-powered clinical data curation hinges on aligning automation with the right combination of actions, processes, and expertise.

Key elements necessary for successfully developing and training an AI model for data curation include:

Access to large volumes of data.

Many AI models are trained on 10,000 or fewer examples—but significantly more examples are necessary for more impactful training.

Ability to normalize large amounts of disparate data.

As clinical data are largely unstructured and are stored across multiple information systems, these data must first be transformed into a consistent and understandable format before AI systems can process, analyze, and extract insights from them.

Consistently training and refining the AI model for accuracy.

With clinical data standards requiring a very high level of accuracy, AI models must be refined enough to meet these requirements—and no technology can yet solve the challenge of accurately processing unstructured data without human assistance. Software and data engineers must work in tandem with clinical experts to ensure the accuracy of the model’s output.

Thoughtfully integrating AI technology into the workflow and user dynamics.

This requires a substantial investment in software development, particularly in enhancing user interfaces and user experience capabilities to enable healthcare professionals to easily draw insights from their data through dashboards and user-friendly tools.

Consistent quality validation of the model to ensure continued accuracy.

Regular testing and quality checks ensure the AI model continues to meet its intended purpose while complying with evolving healthcare standards. Without a robust system in place for consistent quality validation, organizations may struggle to detect and rectify inaccuracies.


Currently, using an AI model such as Chat GPT4 to abstract clinical data at an acceptably high level of accuracy for a registry such as CathPCI would cost approximately three times more than a traditional data abstraction model that involves a combination of AI and human expertise. Automation technology has not yet advanced enough to replicate the combined capabilities of clinical data experts and technology in a cost-effective way.

Achieving success on all of these fronts is not easy to do—but Q-Centrix is well-positioned to take on this challenge. With the data access, resources, experience, and scale necessary to succeed, Q-Centrix will continue to advance its data automation efforts to further optimize AI-enabled data curation.

Q-Centrix’s unique role in leading AI-driven data curation

While the barriers to entry for building a data model may seem low, the barriers to achieving success in this endeavor are substantially high—and the risks of failure can have harmful impacts on patient care. To pursue AI-powered data curation safely, healthcare facilities should partner with an established organization with a proven track record, longevity, and a trusted reputation.

Q-Centrix is uniquely positioned to explore the use of AI for data curation, for several reasons.

Combining technology with clinical experts. Experts in the technology sector don’t often have the medical knowledge necessary to interpret complex clinical datasets—a disconnect that ultimately led to the failure of many AI tools created during the pandemic. Because Q-Centrix relies on a combination of proprietary AI-powered software and clinical experts to curate data and perform quality checks, its approach bridges the gap between medical knowledge and technological expertise. Q-Centrix’s more than 1,300 clinical data experts have strong backgrounds in healthcare and abstract millions of cases each year.

Deep understanding of nuanced clinical terms. Clinical concepts can have different definitions depending on the context. For example, some registries define a family history of heart disease as having an immediate family member dying of a heart attack or stroke before age 60 in women and before age 55 in men. Off-the-shelf AI models may react to any mention of a family heart condition—regardless of its severity, the direct relationship, or age. This highlights the need for nuanced clinical context in data curation, which only clinical experts can provide.

Streamlined processes that prioritize data integrity. Q-Centrix is committed to maintaining the highest data integrity standards in the industry. Q-Centrix implements a series of quality checks throughout the data lifecycle, spending approximately 10,000 hours per month conducting quality-related checks on data.

User-friendly software. Q-Centrix’s offerings go beyond data curation to ensure that healthcare facilities have the tools they need to engage meaningfully with their data. Q-Centrix’s market-leading clinical data management software provides a comprehensive suite of analytics and reporting tools, empowering clinical and quality leaders to uncover valuable insights that drive clinical decision-making and quality improvements.

Experience. Q-Centrix has over a decade of experience managing clinical data for more than 1,200 hospital partners, making its AI-driven technology extremely well-trained in reviewing data to ensure data integrity.


Hospitals need to trust their data. When clinical data are the cornerstone of patient care, groundbreaking research, quality improvement, and so much more, ensuring the integrity of these data is paramount.

For AI-driven data curation to be meaningful and effective, it must be capable of maintaining the highest data quality standards—which today’s technologies can’t yet do alone. Due to the complexities of clinical data curation—and the risks inherent in low-quality data—a combination of clinical data experts, software, and optimized processes must be used alongside AI technologies to curate high-quality data effectively.

Q-Centrix is well positioned to lead in AI-driven data curation given our strong commitment to data quality, our experience curating clinical data for more than 1,200 hospital partners, our 1,300+ clinical data experts, and our investments in technology. Through our efforts, hospitals and health systems can improve data integrity and derive deeper meaning from their data. Moreover, life sciences organizations and research institutions can gain valuable assistance in identifying study patients and overcoming common research roadblocks.

As we move forward, we are excited to advance our use of AI technology while recognizing that, like all new technologies, it will require time, investment, and deliberate effort to ensure high data standards and continued progress in the healthcare industry.