Research Computational Science
Becoming a leading destination for computational and data science research in pediatrics.
CHOC Research - Go Beyond

Computational and Data Science Research at CHOC

Research Computational Science is one of the key services cores in CHOC’s Research Institute.

We are computational scientists, data scientists, and biostatisticians driven by the fulfillment of improving the health of children in our communities and across the United States.

We hold degrees across all academic levels and collaborate with academic and medical institutions across the US and abroad.

Our History

Terence Sanger, MD, PhD, and Phuong Dao, JD, founded the Research Computational Science program and developed it alongside Louis Ehwerhemuepha, PhD, in August 2021.

Dr. Ehwerhemuepha had led development of a CHOC Data Science program between June 2015 and July 2021. He supported Dr. Bill Feaster and Dr. Anthony Chang during that period and continues doing so with expanded resources from the Research Institute.

The RCS team is key to the overall Data Science initiatives at CHOC.

Our areas of work

We retrieve, analyze, and interpret data to help doctors, nurses and administrators with information to improve decision-making and clinical outcomes.

We build computational and data science models to predict undesirable clinical events before they occur, before providers are certain it would occur and with sufficient time for intervention.

We conduct world-class research and publish findings in top-tier journals in medicine, statistics and data science in general.

We assist other CHOC research groups and individuals and support graduate students from local universities with data and computational tools for analyses of all modalities of data at CHOC.


What happened? Why did it happen? How can we prevent it from happening again if undesirable?


We are supporting multiple providers within CHOC as well as collaborations with researchers across other institutions in the US including researchers from the CDC. Themes of research questions we address include:
  • Predisposition to severe disease
  • Prediction of post-acute sequalae of SARS-CoV-2 (PASC), also known as long COVID
  • Exacerbation of preexisting conditions post COVID-19
  • Prediction of organ dysfunction/failure
  • Multisystem inflammatory condition in children (MIS-C)
  • Special focus on selected at-risk populations (Asthma, Cardiology, Cystic Fibrosis and Sickle Cell Disease)

Adverse Childhood Experience

Supporting the CHOC-UCI research initiative in Public Health led by multiple specialties at CHOC and faculty from UCI School of Public Health. Goal is to examine adverse childhood experiences (ACEs) in Southern California and risk factors of exacerbation as measured by healthcare utilization. Cohort includes patients assessed for ACE in the Emergency Department and Primary Care Clinics at CHOC.

Epilepsy (predictors of intractable epilepsy)

To predict patients who will develop intractable epilepsy three months before its development. Knowing which patients are likely to develop intractable epilepsy will allow physicians and patients to develop an appropriate care plan. The model will be fitted using CHOC patient data (demographic, medical history and clinical notes).

Rare diseases

Juvenile Dermatomyositis: Juvenile Dermatomyositis (JDM) is a rare multifaceted autoimmune disease that usually presents with a characteristic rash and symmetrical proximal muscle weakness and may impact nailfold capillary end row loops (ERL). In this series of studies, we are identifying serological markers of disease activity as well as developing models for predicting disease course after initial clinic consultation.

Clinical nutrition

Dietary diversity: Supporting the Clinical Nutrition and Lactation team to assess dietary intake, oral supplement dependence, and dietary diversity among CHOC patients with pediatric feeding disorders. The findings from this study will be used to improve dietary recommendations for patients with feeding disorders to reduce their frequency of nutrient inadequacies CHOC patient dietary and demographic data were modeled using quasi-poisson regression to investigate how these characteristics affect nutrient inadequacies.

Healthcare utilization

Mental health rehospitalization: development of models using structured data to predict risk of rehospitalizations that will drive outpatient mental health interventions.

Rising risk model: Predicting which patients are likely to become a rising risk for increased resource utilization in the following year. The results will assist clinicians in identifying which patients will need additional resources within the next year. The data will consist of demographics, historical diagnoses and mental health characteristics of CHOC patients.

ED return visits: Assessing potential novel improvements to the difficult prediction tasks of predicting ED return visits to optimize utilization and reduce waste.

Thyroid diseases and evaluation scales (individualized normal thresholds for TSH, T3 and T4)

Determining whether intra-subject variation of thyroid laboratory test levels is significantly smaller than the inter-subject variation. Findings from this study may be used to justify patient-specific ranges of appropriate laboratory test levels instead of population-wide ranges. This will be investigated through an analysis of variance components in a linear mixed model using parametric bootstrap.


Investigating whether Hispanic children with Duchenne muscular dystrophy treated with steroids have a higher body mass index than their non-Hispanic counterparts. This study aims to determine whether Hispanic patients should only be treated using medications with a lower risk of weight gain.

Neonatal readmission

What is the relationship between gestational age, day of life at NICU admission and risk of readmission? In this study, we assess how gestational age and Day of Life (DOL) on neonatal intensive care unit (NICU) admission modifies the risk of 30-day unplanned hospital readmission.


To examine whether medical history (diagnoses, prescriptions, encounters, etc.) and emergency department encounter data (at triage) can be used to predict whether a patient will be admitted to the general floor, admitted to the ICU or discharged home during an ED visit for asthma exacerbation. Additionally, this study examines whether medical history and encounter data (from triage to discharge) can be used to predict whether patients will have a return visit within 14 days of being discharged home after an ED visit for asthma. This study aims to identify historical and at-encounter characteristics that may impact emergency department discharge disposition to improve patient outcomes.

Intensive care medicine

ICU late transfers and bounce backs: Assessing models to reduce late transfers to the ICU and premature discharge from the NICU. Goal is a real-time ICU model that informs providers on rapidly deteriorating patient conditions and likelihood for relapse after ICU treatment.

How can we learn using structured tabular electronic health records from the past to predict the future in real time and improve clinical outcomes?

Hospital readmission

Rebuilding and redeploying the hospital’s readmission model to address impact of COVID-19, distributional shifts and important sub-specialty patients including neonatal and mental health utilizations.

Rising risk model

Predicting which patients are likely to become a rising risk for increased resource utilization in the following year. Healthcare utilization is used as proxy for deterioration in health that may be preventable. Study involves analyses of confounding due to challenges with access to care.

Earliest warning system for sepsis

Updating and redeploying an ED triage model for predicting patients who are at risk for sepsis and may require early interventions.

Autism triage

Developing a model to determine whether a confirmatory or comprehensive autism evaluation is appropriate for a patient. This model will aid with triaging patients at the Thompson Autism Center.

Juvenile Dermatomyositis (JDM)

Patients undergoing treatment for JDM respond to treatment differently. In this study, we are developing models to predict early or late response to treatment to inform interventions that will increase the proportion of patients with early and sustained response to treatment.

CPAP Failure

Noninvasive oxygen therapy is preferred over more invasive procedures if it will suffice and for as long as it suffices. In this study, we are developing models to predict early needs to change oxygen therapy from CPAP to more invasive options. This includes determination that invasive therapy is appropriate from admission or predicting the optimal time to change from CPAP to ECMO or mechanical ventilators. Previous research has shown that late change in therapy can increase morbidity. However, unnecessary use of invasive oxygen therapy also increases morbidity.

How can we learn from unstructured data (images, clinical notes, movies) to predict the future in real time and improve clinical outcomes?

Mental predictions

  • Disparity of care due to SES: Determining whether there are disparities in mental health outcomes by socioeconomic status using unstructured clinical/provider notes. Socioeconomic status will be inferred from the notes as well as health insurance payer type.
  • Disparity of care among LGBTQ+ patients: Examining health care disparities and non-healthcare factors among LGBTQ+ patients that may exacerbate the impact of COVID-19 on mental health. Deidentified copies of clinical notes will be analyzed to extract clinical entities relating to LGBTQ+ status, putative risk factors and other statistically identified factors that may impact outcomes or are confounders therein.

Steroid free remission (inflammatory bowel disease)

Primary objective is to predict steroid free remission at 12 and 52 weeks following pathology-confirmed diagnosis of ulcerative colitis or Crohn’s disease and using corresponding tissue pathology. In addition, multimodal networks will be trained on images tissue pathology, structured data and clinical notes on clinic visits and pathology. Findings will help inform treatment decisions and clinical practice guidelines for these patients.

Focal cortical dysplasia (FCD)

This study encompasses the detection and segmentation of abnormal lesions associated with FCD using MR imaging. FCD is the most common cause of intractable focal epilepsy in children, in which neurocognitive dysfunction and behavioral problems may also be present. The objective of this study is to improve pre-surgical detection and diagnosis of FCD from MRI using deep learning techniques. Furthermore, segmentation of regions of interest may guide surgical interventions as well as timing of surgery.

Rare diseases (Juvenile Dermatomyositis Nailfold Capillaroscopy Analyses)

  • Predicting disease activity
  • Detecting density of end-row capillary loop

Peripherally Inserted Central Catheter (PICC) line complications

Can we predict PICC line complications (such as infections) using cellphone images of the insertion site? Children with chronic/critical illnesses may require insertion of central venous catheters (CVC) for delivery of medications and other intravenous fluids. These CVC increase the risk of complications including central line-associated bloodstream infections (CLABSI) and may expose patients to increased morbidity and mortality. Research identifying risk factors of CVC complications and corresponding machine learning applications have depended on structured electronic health records of hospitalized patients. Our multidisciplinary team of clinicians and data scientists aim to extend the type of data for prediction of CVC complications to include images of PICC line insertion sites requiring deep convolutional neural networks. Expected outcomes include significant improvements in predicting in-hospital CLABSI and novel models/applications for at-home monitoring.

Context-aware clinical notes summarization

Providers can learn a lot about a new patient’s history through conversations with the patient and family. In this study, we will be applying existing algorithms as well as developing new ones for context-aware clinical notes summarization starting with sub-specialties and expanding to more general models. These summarizations will fill gaps inherent in verbal recollection of patient’s history and ensure provide get information in the context most helpful for treatment.

How can we solve computational problems requiring new numerical solutions to difficult optimization models?

Small samples with high dimensions

Development of new statistical learning algorithms for learning from conditions (such as rare diseases) wherein sample sizes are small but data is high dimensional including genomic data

How can we build the computational and data science systems required to deploy real-time clinical models, integrate with EHR workflow, and develop corresponding intervention protocols?

  • Hospital readmission (proven to reduce readmission rates and reduce healthcare expenditure)
  • Rising risk (we predict who will need more care within 12 months)
  • Sepsis (we developed the earliest warning system for sepsis at ED triage and are expanding to real-time predictions over the ED and any corresponding hospital stay)
  • Autism (we are developing models to help with triaging patients requesting evaluation for autism)

Contact Us


Location: CHOC Commerce Tower
505 Main St. 10th Floor
Orange, CA 92868


Research Database List

This list of databases constitutes the major sources of research data at CHOC. Please reach out directly to us at for specific questions and to clarify your research needs.

EMR data may be classified by their format. Structured data are tabular data that may fit into an SQL (Structured Query Language) table. Unstructured data involves clinical notes, other free-text data, images and movies.

Structured (tabular) data

The RCS team provides access to this type of data using instances of CHOC’s HealtheIntent database. All data including diagnosis codes, medications, laboratory test values, vital signs and discrete elements are captured from certain forms. There are other sources of data that the RCS team may use, as needed, to support your research.

Unstructured data

  • PACS images: please reach out to if you have research questions that require medical imaging.
  • Clinical notes: please reach out to if you have research questions that require automated natural language processing of clinical notes. We will let you know which type of research or information retrieval we can automate using natural language processing.
  • Feel free to inquire about any other needs for CHOC clinical data for research

This database consists of tabular data from more than 100 health systems in the United States. It does not include clinical notes, medical images or other forms of unstructured data. However, it contains data across all care settings on patient encounters, medications, diagnoses, laboratory tests, orders, procedures, allergy, vital signs, other clinical events and information on the deidentified health systems contributing to the database. The database includes more than

  • 100 million patients (of which 20 million are children less than 18 years)
  • 1.5 billion encounters
  • Tens of billions of clinical data on these patients

The best approach for these data types is to reach out to us at and we will connect with Research Coordinators of each division to request access to corresponding data required for research (if we do not already have access). There are many disease- or division-specific registries and external data sources. Here are some examples for Trauma and Psychology.


  • National Inpatient Sample 1988-2018 (PI must sign national HCUP DUA, state databases available but separate applications to state)
  • National Emergency Department Sample 2006-2018 (PI must sign national HCUP DUA, state databases available but separate applications to state)
  • National Trauma Data Bank 2007-2018 (National submission of protocol application/approval required by ACS COT)
  • National Emergency Medical Services Information System 2017-2020 (National submission of protocol application/approval required by NHTSA/EMS)
  • National Electronic Injury Surveillance System 2000-2020 (Publicly available by CPSC)
  • National Violent Death Reporting System 2002-2018 (National submission of protocol application/approval required by CDC)
  • Kids' Inpatient Database 2012, 2016 (PI must sign national HCUP DUA, state databases available but separate applications to state)


  • National Survey on Drug Use and Health 2004-2011 (Publicly available by SAMHSA)
  • Mental Health Client-Level Data 2013-2019 (Publicly available by SAMHSA)
  • Treatment Episode Data Set: Admission 2001-2019 (Publicly available by SAMHSA)
  • Treatment Episode Data Set: Discharges 2006-2019 (Publicly available by SAMHSA)

Research Study Process and Standard Operating Procedures of the RCS Team

The RCS team is charged with supporting all members of the organization interested in research. As a result, each team member is often tasked with managing multiple projects as primary responsibility. Estimated timelines will vary by workload.

  1. Conception and ideation (2-8 weeks)
    1. An iterative process to reconcile the clinical and statistical significance of the study hypotheses or questions.
    2. If the investigator is uncertain of what to study, we engage in discussions and perform a literature review to determine what is important to the investigator and what is pertinent to improving the field.
    3. Once the investigator has chosen a question of interest, we work to refine ideas specific to the available data, and in consideration of statistical, machine learning or artificial intelligence (AI) model options.
    4. Typical timeline: Highly variable, 2 to 8 weeks.
    5. Completion of milestone: 1 to 2 sentences that clearly states a hypothesis or study question.
  2. Protocol development (4-8 weeks)
    1. Study design: The team will work with investigators to design the study, ensuring there is balance of both clinical and statistical significance.
    2. Protocol development: The RCS team will draft the Methods and Statistical Consideration sections of the IRB protocol for submission.
    3. Average timeline (without considering IRB review time): 4 to 8 weeks.
    4. Completion of milestone: IRB approval (or waiver).
  3. Data retrieval (Timeline is variable and depends on scope of study)
    1. Possibly the most time intensive part.
    2. The team will develop a data retrieval plan in collaboration with the investigator.
    3. Data retrieval and preprocessing.
    4. Internal validation of data.
    5. Summary statistics and clinical review of data with the clinical team.
    6. Average timeline: Variable and difficult to predict. The timeline will depend on data sources and scope of variables as well as potential for unexpected events during data retrieval. The RCS team will provide a timeline on a case-by-case basis.
    7. Completion of milestone: Summary statistics as appropriate for study.
  4. Data analyses (Timeline is variable and depends on scope of study)
    1. This involves appropriate statistical, machine learning and AI modeling constrained by appropriate clinical consideration, approved IRB protocol and study design.
    2. Average timeline: Variable and dependent on the number of hours spent per week on the project. This may involve some iterations. The RCS team will provide a timeline on a case-by-case basis.
    3. Completion of milestone: Adoption of statistical, machine learning or AI model.
  5. Manuscript assistance (2-4 weeks)
    1. The team will draft the Methods and Results sections while the investigator is responsible for drafting the Introduction, Discussions and other sections of the paper.
    2. Average timeline: 2 to 4 weeks.
    3. Completion of milestone: Draft of Methods and Results sections
  6. Peer review revisions
    1. The team will assist in peer review revisions that involve changes in models developed or study design as appropriate.
    2. Average timeline: Unpredictable.
  7. Closing of project or study
    1. This occurs after publication and/or presentation of results.