Welcome to CFDE Documentation’s documentation!

The Common Fund Data Ecosystem’s Crosscut Metadata Model (CFDE C2M2) Roadmap

This document introduces the Crosscut Metadata Model (C2M2)], a flexible standard for describing biomedical experimental data. The Common Fund Data Ecosystem (CFDE) group is creating a new infrastructure, with C2M2 as its central concept, through which powerful cross-dataset searches, custom aggregation of experimental data and scale-powered statistical analysis methods will be made possible for the biomedical research community at an unprecedented scope.

Using this new infrastructure, the data coordinating centers (DCCs) can share structured information (metadata) about their experimental resources with the research community, dramatically widening access to usable observational data and accelerating discovery.

DCC Metadata Submissions

The DCCs will collect and provide metadata to CFDE describing experimental resources within their purview. Each metadata submission will take the form of a collection of tab-separated value (TSV) files. Precise formatting requirements for these TSV collections will be specified by JSON schema documents implementing the Data Package meta-specification published by the Frictionless Data group. These schemas will be used by the CFDE software infrastructure to automatically validate submission format compliance and metadata integrity during the metadata ingestion process.

The CFDE will offer the DCCs multiple alternatives for metadata submission formats, all of which will be automatically interoperable with the C2M2 ecosystem. These alternative formats are arranged in levels, tiered according to increasing complexity and reflecting anticipated differences in the relative richness of metadata available to different DCCs at any time. The general expectation will be that the metadata submitted and managed by a DCC will be able to transition, over time, through increasingly rich modeling levels as the life cycle of DCC/CFDE technical interaction progresses, which will enable increasingly powerful downstream applications.

C2M2 Richness Levels

In its current form, C2M2 is an entity-relationship system that models common properties of resources fundamental to biomedical research like subjects, digital files, events, biospecimens, and project datasets. Essential relationships between these fundamental resources are also formally described, documenting, for example:

  • the samples that were processed to produce a particular data file
  • which subject a given sample was drawn from (possibly obfuscated to protect patient privacy)
  • when a particular blood pressure measurement was made

While we know our current model is not yet rich enough to contain all the metadata that will be required by all DCCs, there are also too many terms for most DCCs to easily use. Modeling and data wrangling are always difficult, even for experts. Requiring every DCC to model their metadata using all possible features of the current C2M2 model as a precondition for submitting metadata to CFDE would be impractical for several important reasons.

  1. Many of the DCCs host human data, and for all but very basic file information, the associated metadata is protected. Currently, the CFDE does not have protected data access or a permission to make these metadata searchable. Requiring DCCs to supply these terms would make it illegal for most DCCs to comply.
  2. From an operational standpoint, the C2M2 model must remain as flexible as possible, especially during its developmental phases, to accommodate mutual learning between the DCCs and CFDE as the process of data ingestion develops. We do not want to lock ourselves in to our current model topology before we begin adding real data. It is far more expensive and error-prone to repeatedly change a complex model than it is to build one gradually from a simpler core concept that can stabilize before more specialized branches are added.
  3. Even if the issues with protected metadata were solved and the model was perfected, the complexity of the overall model would create avoidable and unnecessary onboarding delays for any new DCC.

With the design of C2M2, we are splitting the difference between the ease of evolution inherent in a simple model and the operational power provided to downstream applications by more complicated and difficult-to-maintain frameworks. DCCs with advanced, operationalized metadata modeling systems of their own should not encounter arbitrary barriers to CFDE support for more extensive relational modeling of their metadata if they want it; CFDE will maintain such support by iteratively refining the current C2M2 model according to needs identified while working with more operationally advanced DCCs.

Newer or smaller DCCs, by contrast, may not currently have enough readily-available information to describe their experimental resources using the most complex C2M2 modeling level. The CFDE will support cases like these by offering simpler but still well-structured metadata levels, lowering some of the barriers to rapid entry into the data ecosystem. We expect this concept of levels to be useful even after all current DCCs are onboarded: when the Common Fund funds new programs, they will all have ramp up phases where their data is less rich than the more mature DCCs.

Simpler C2M2 metadata levels must be maintained by the CFDE to maximize interoperability with more complex C2M2 variants; the whole system should be structured to minimize the negative side effects of overall model changes. These considerations have led to the creation of C2M2 richness levels: concentric, canonical subsets of C2M2 that are benchmarked at increasing levels of model complexity and detail, wherein each successive modeling level is a value-added superset of all of the metadata encompassed by the previous (less complex) level.

Presently, the CFDE offers two less complex C2M2 variants in addition to the most complex, current C2M2 model: Level 0 (basic metadata describing a collection of digital files) and Level 1. Level 1 accomplishes the following:

  • Introduces terms for core experimental resources, like samples and subjects.
  • Provides a rudimentary set of search targets in the form of annotations, like the anatomical location of the source for a human tissue sample or taxonomic data describing sample source organisms and study subjects.
  • Supports arranging experimental resources into sub-collections based on a hierarchy of projects, studies, or other similar subdivisions of research ownership and responsibility.

Levels 0 and 1 consist entirely of unprotected metadata terms, and therefore should be achievable by all DCCs. These two levels are sufficient to support two minimal Use Cases. Using only the metadata from Levels 1 and 2, a researcher can find datasets from across the Common Fund that contain information about her tissue of interest, assayed using her method of choice. Similarly, staff at the NIH can compare the tissues, species and assay types available across DCCs. More complex Use Cases, such as the ability for a researcher to find datasets for patients with a specific disease in a specific age range will require the DCCs to submit Level 2, protected, metadata terms. However, this is limited by our access to protected data, as we cannot legally accept that data until certain security and administrative issues are solved.

C2M2 Development Roadmap


Level 0

C2M2 Level 0 defines a minimal valid C2M2 instance. Data submissions at this level of metadata richness will be the easiest to produce and will support the simplest available functionality implemented by downstream applications.

The full specification can be found in the accompanying Level 0 popout document.


Level 1

Level 1 introduces tables for core experimental resources like:

  • Samples and subjects
  • Search targets in the form of annotations, like the anatomical source for a given tissue sample
  • Host species taxonomy for samples and subjects
  • Basic support for arranging experimental resources into sub-collections based on a hierarchy of projects or studies

The full specification can be found in the accompanying Level 1 popout document.


(Level 2)

Level 2 introduces tables for core experimental resources like:

  • Sex
  • Age
  • Disease
  • Health conditions
  • Vital stats. e.g. height, weight, BP, etc.

We have a draft specification of Level 2; preliminary documents for the schema can be found here. Finalization of Level 2 will be completed to achieve a CFDE portal demonstration in December 2020. Level 2 can be used to represent protected data, which will require the completion of several important administrative and policy milestones.


(Level 3)

Level 3 introduces tables for core experimental resources like:

  • Sequencing technology, e.g. Illumina, nanopore, 454
  • Geographic location
  • Race/ethnicity (human)/strain (mouse)
  • Sample collection date
  • SOP used for extraction or analysis process

The full specification is currently outlined, but not under active development. It will be formalized in the coming months and implemented in a later demo. As with Level 2, it is dependent on CFDE access to protected data.


(Level N)

Based on feedback from the DCCs, and the terms that are important for searching their data, we anticipate the need for further levels. Until we begin receiving metadata from the programs, we cannot predict what terms will be added or what constellations of terms will be required to make a compliant model. However, we do not expect the number of levels to exceed 5, as some new terms will best fit as amendments to previously defined levels.

The full specification is expected, but not outlined or under active development. It will be outlined once the DCCs begin to submit data and formalized in collaboration with our DCC partners. As with Level 2, implementation of these levels will be dependent on CFDE access to protected data.


Asset Manifest Specification

Introduction

Technical Specification for CFDE Data Assets and Instructions for Preparing Data Asset Manifests

This document reviews the technical details for the Common Fund Data Asset Specification, explains how to build formal Data Asset Manifests describing collections of experimental data files, and describes how to prepare and submit these manifests to the CFDE database. To understand this process, we will review the Crosscut Metadata Model (C2M2), which is used to describe experimental resources. C2M2 is divided into numbered “levels” that reflect increasing degrees of complexity of data and metadata descriptions. C2M2 Level 0 (the subject of this document and in which data assets and the data asset manifest are defined) is the minimum information needed to describe a basic inventory of all of a DCC’s digital files; higher C2M2 levels will be useful in supporting queries based on more complex experimental metadata via the CFDE portal.

Background

The CFDE Crosscut Metadata Model (C2M2)

The Common Fund Data Ecosystem group is creating a new software system centered around the Crosscut Metadata Model (C2M2), a flexible technical standard for modeling biomedical experimental resources and data at any of the several predefined levels of model complexity. This system is designed to support powerful cross-dataset and cross-institute searches, custom aggregation of experimental data, and scale-powered statistical analysis methods for the biomedical research community, all at an unprecedented scope.

Using the C2M2 system, Common Fund Data Coordinating Centers (DCCs) will be able to share structured information (metadata) about their experimental resources with the research community, widening and deepening access to usable observational data and accelerating discovery.

DCC Metadata Submissions

DCCs will collect and provide metadata to the CFDE describing experimental resources within their purview. Each metadata submission will take the form of a collection of tab-separated value (TSV) files. Precise formatting requirements for these TSV collections will be specified by JSON schema documents implementing the Data Package meta-specification published by the Frictionless Data group. These schemas will be used by the CFDE software infrastructure to automatically validate submission format compliance and metadata integrity during the metadata ingestion process.

The CFDE will offer the DCCs several alternative metadata submission formats, all of which will be automatically interoperable with the C2M2 system. These alternative formats are arranged in levels tiered by increasing complexity and reflecting anticipated differences in the relative richness of metadata producible by different DCCs at any time. The expectation will be that the metadata submitted and managed by a DCC will be able to transition, over time, through increasingly rich C2M2 modeling levels as the life cycle of the DCC/CFDE technical interaction progresses, which will enable increasingly powerful downstream applications.

Technical Specification

C2M2 Level 0: A Basic Metadata Manifest of Digital File Assets

This C2M2 Level 0 specification defines a minimal valid C2M2 instance. DCC metadata submissions at this level of model complexity will be the easiest to produce and will support the simplest available functionality implemented by downstream C2M2-driven applications.

Level 0 Submission Process Overview

Metadata submissions by the DCCs to the CFDE that are compliant with C2M2 Level 0 will consist of two TSV files:

  • file.tsv will be a manifest of digital file assets that a DCC wants to introduce into the C2M2 metadata ecosystem. The properties of the file entity in the C2M2 Level 0 model (see below for the model diagram and a list of property definitions) will serve as column headers for file.tsv and each TSV row will represent a single file. DCCs will prepare file.tsv using data describing digital files within their management purview.
  • namespace.tsv will serve as a formal structural placeholder for a namespace identifier, which will be assigned to each DCC by the CFDE. The CFDE will create and furnish a namespace.tsv file for each DCC to include with Level 0 submissions.

C2M2 Level 0 encodes the most basic file metadata; its use by downstream applications will be limited to informing the least specific level of data accounting, querying, and reporting.

|Level 0 Model Diagram| |:—:| |Level 0 model diagramLevel 0 model diagram|

Level 0 Technical Specification: Properties of the file Entity

Required: id_namespace id sha256|md5

|Property|Description| |:——:|:———-| | id_namespace | String identifier assigned by the CFDE to the DCC managing this file. The value of this property will be used together with id (assigned to each file by the DCC that owns it) as a paired-key structure formally identifying Level 0 file entities within the total C2M2 data space.|

| id | Unrestricted-format string identifying this file, assigned by the DCC managing it. Can be any string as long as it uniquely identifies each file within the scope of a single Level 0 metadata submission. |

| size_in_bytes | The size of this file in bytes. This varies, even for “copies” of the same file, across differences in storage hardware and operating systems. The CFDE does not require any particular method of byte computation; file size integrity metadata will be provided in the form of checksum data in the sha256 and/or md5 properties. size_in_bytes will instead underpin automatic reporting of basic storage statistics across different C2M2 collections of DCC metadata.|

| sha256 | CFDE-preferred file checksum string–the output of the SHA-256 cryptographic hash function after being run on this file. One or both of sha256 and md5 is required. |

| md5 | Permitted file checksum string–the output of the MD5 message-digest algorithm after being run as a cryptographic hash function on this file. One or both of sha256 and md5 is required. The CFDE recommends using the SHA-256 algorithm if feasible, but we recognize the nontrivial overhead involved in recomputing these hash values for large collections of files, so if MD5 hashes have already been generated, then the CFDE will accept them. |

| persistent_id | A persistent, resolvable URI generated by a DCC (e.g., using the CFDE minid server) and permanently attached to this file. It serves as a permanent address to which landing pages, which summarize metadata associated with this file, and other relevant annotations and functions can eventually be attached, including (optionally) resolution to a network location from which the file can be downloaded. Actual network locations must not be embedded directly within this identifier; one level of indirection is required to allow network addresses to change over time as files are moved around. |

| filename | A filename with no prepended PATH information. |

Schema

Level 0 Metadata Submission: frictionless.io datapackage.json Schema Specification

The JSON schema document formally specifying all data constraints on Level 0 TSVs is located here, and an example Level-0-compliant TSV submission can be found here, as just the file.tsv portion or as a full BDBag archive.

Proposed Technical Specification

Level 1

The specifications for this level are for informational use only and are still under internal development and are not yet finalized. Do not use this version for creating CFDE manifests.

…introduces models for core experimental resources like

  • samples and subjects
  • search targets in the form of annotations like the anatomical source for a given tissue sample
  • host species taxonomy for samples and subjects
  • basic support for arranging experimental resources into sub-collections based on a hierarchy of projects or studies

…also introduces two containers for aggregating experimental resources & metadata:

  • project describes administrative/funding/contract/etc. hierarchy governing ownership/management/purview/responsibility of/for subcollections of experimental resources and metadata
  • collection allows any (non-cyclic) groupings to be assigned to subcollections of experimental resources and metadata (independently of contract or funding or ownership or accountability/reporting structures encoded by project): similar in concept to “dataset” but without implying the existence of a formally-prepared publication-level data package – any coherent and meaningful grouping can be encoded here

|Level 1 model diagram| |:—:| |Level 1 model diagramLevel 1 model diagram|

Level 1 technical specification: the file entity, revisited

added properties

Level 1 technical specification: introducing the bio_sample entity

added entity: list and define properties

Level 1 technical specification: introducing the subject entity

added entity: list and define properties

Level 1 technical specification: using the project table

describe the project table

Level 1 technical specification: using the collection table

describe the collection table

Level 1 technical specification: using association tables to encode inter-entity relationships

describe TSV encoding of bio_sample<->subject<->file<->bio_sample association pairs

Level 1 technical specification: using terms from controlled vocabularies: usage tables

enumerate CVs; describe usage tables and outline plan for addressing versioning; discuss parser script, to be executed somewhere in bdbag-preparation stage, which will inflate bare CV terms cited in entity fields into corresponding CV usage tables, loading term-decorator data from relevant CV OBO reference files

Level 1 metadata submission examples: schema and example TSVs

A JSON Schema document specifying the Level 1 TSV collection is here; an example Level-1-compliant TSV submission can be found here (as a collection of TSV files) and here (as a packaged BDBag archive).

C2M2 Specification with Levels Glossary

Biospecimen

A material collected from an organism, a cell culture, or a material containing organisms, such as an environmental material.

CFDE Asset Manifest

A collection of Assets described by the CFDE Asset Specification. The ecosystem will support the concept of a manifest that describes a collection of files. The manifests enable bundling lists of CFDE data assets into a machine-readable file using a common format. Manifests will also be used to publish the complete inventories of data from each DCC, and will enable uniform collection of asset metadata to support indexing of the assets in the CFDE portal.

CFDE Asset Specification

Defines the set of attributes used to charaterize an Asset. The specification simplifies the discovery of assets hosted at the DCCs with a minimal set of descriptors for each of these files. The types of files that are referenced (e.g., genomic sequence, metagenomic, RNA-Seq, physiological and metabolic data) are flexible and contain a small number of essential elements such as a GUID, originating institution (e.g., Broad Institute), assay type (e.g., whole genome/exome, transcriptome, epigenome), file type (e.g., fastq, alignment, vcf, counts), and tissue source and species name for the sample.

C2M2

Cross Cut Metadata Model. How we describe the interrelationships of metadata terms. It specifes both the metadata terms and how those terms are semantically related to all the other terms in the model. For example, we specify that each Biospecimen must come from a Subject.

Dataset

A collection of data, published or curated by a single agent, and available for access or download in one or more formats.

DCC

Data Coordinating/Resource Center.

Digital File Assets

Digital objects that each of the DCCs host, such as genomic sequence, metagenomic, RNA-Seq, physiological, descriptive, and metabolic data.

Entity Relationship Model

A way to describe the interrelationships of terms. It specifes both the term and how that term is semantically related to all the other terms in the model.

Event

Specific instances of data gathering for a specific patient, as in a specific surgery or appointment

Metadata

A type of information entity usually defined as data about the data, understood as descriptors to understand the context of a dataset. For example, metadata about an FASTQ file may be file size or file creator. Metadata is often classified into descriptive metadata, structural metadata, administrative metadata, and provenance metadata, all of which provide context to the actual data/dataset.

Metadata Ingest

Assigning identifiers to the objects and then extracting or creating metadata for these objects.

Richness Levels

Concentric, canonical subsets of C2M2 that are benchmarked at increasing levels of model complexity and detail, wherein each successive modeling level is a value-added superset of all of the metadata encompassed by the previous (less complex) level

Subject

A study participant (human, animal) from which samples may be obtained.

C2M2 JSON Schema datapackage specs

C2M2_Level_0.datapackage.json C2M2_Level_1.datapackage.json full_C2M2_datapackage_spec.json

C2M2_ER_diagrams

Graphics depicting C2M2 entity-relationship models (or just an entity model, in the simplest case), illustrating metadata richness levels.

Full C2M2 ER Model

|Full model diagramFull model diagram|

Level 0 C2M2 ER Model

|Level 0 model diagramLevel 0 model diagram|

Level 1 C2M2 ER Model

|Level 1 model diagramLevel 1 model diagram|

Level 2 C2M2 ER Model

|Level 2 model diagramLevel 2 model diagram|