Strategy of harmonization
- Prospective harmonization
- Typically used in multi-center studies, this strategy imposes strict standards and protocols from the beginning. All cohort studies share the same study design, survey, meta-data, etc. Some adaptations may occur for individual data collection sites, but the goal is to maintain comparability.
- Ex-ante retrospective harmonization
- This strategy combines data from cohort studies that were not specifically designed to be comparable, but they used standard collection tools and standard operating procedures permitting data to be easily integrated.
- Ex-post retrospective harmonization
- This strategy combines data from cohort studies that were not specifically designed to be comparable, but no standard formats or protocols were used in general. Data can anyway be assessed and edited to achieve commonality through data processing procedures.
Data processing methods
- Algorithmic
- Harmonize the same measures (continuous variables, categorical, or both) with different but combinable ranges or categories.
- Calibration
- Harmonize to the same metric measure.
- Standardization
- Harmonize the same constructs measured using different scales with no known calibration method or bridging items.
- Latent variable model
- Harmonize the same constructs measured using different scales with no known calibration method but with bridging items present.
- Multiple imputation
- Harmonize datasets (and not variables) with the same set of variables using bridging variables.
Types of infrastructure
- Data are centrally located
- Data from all studies are stored on the same server.
- Data are in different locations
- Data from each study is stored in their local server. Each study imposes its data restrictions.
- Some centrally, other locally
- Some studies share their datasets to be stored in the same server, and other studies store their datasets in their local server.
Integrative data analysis
- Meta-analysis
- Combines the results of multiple studies addressing the same variable.
- Pooled-analysis
- Analyses can be carried out at individual-level after pooling data.
- Federated analysis
- Centralized analysis with individual-level data remaining on their local servers.
Software
- OBiBa (Opal/Mica)
- OBiBa software suite (obiba.org), developed by Maelstrom Research (maelstrom-research.org) and Epigeny (epigeny.io), includes advanced software components enabling data harmonization and federation for study networks that aim to harmonize and share data securely among their members.
- DataShield
- DataSHIELD is a method that enables advanced statistical analysis of individual-level data from several sources without actually pooling the data from these sources together (datashield.ac.uk).
- Molgenis
- Molgenis is an open-source web application to collect, manage, analyze, visualize and share large and complex biomedical datasets (https://www.molgenis.org/)
- CharmStats
- CharmStats allows you to work with your variables, to document the process as you go and even electronically publish your completed harmonization for review and citation (gesis.org/en/services/data-analysis/data-harmonization).
- R / Rmarkdown
- R is a free software environment for statistical computing and graphics (r-project.org). R-markdown turns the R analyses into reproducible documents (rmarkdown.rstudio.com).
- Stata
- Stata is a statistical software for data management, statistical analysis, graphics, simulations, regression, and custom programming (stata.com).
- SAS
- SAS is a statistical software suite for data management, advanced analytics, multivariate analysis, business intelligence, criminal investigation, and predictive analytics (sas.com).
- SPSS
- SPSS is a software platform that offers advanced statistical analysis, a vast library of machine learning algorithms, text analysis, open-source extensibility, integration with big data and seamless deployment into applications (ibm.com/analytics/spss-statistics-software).