Prepare genomic data

Genomic data ingest consists of storing file metadata and pointers to genomic files within the htsget_app micro-service. Your CanDIG instance needs to be connected to the place where your genomic data is stored - CanDIG does not duplicate these files into internal separate storage.

A. Configure connection to Genomic Data source/s

Systems may need information in the drop-downs below:

Configuring s3 authorisation

Pre-requisites:

All files in s3-compatible storage bucket
Access and secret keys for the s3 storage bucket - it is recommended this has read access only
Be a site admin user and be able to get an authorization token

Configuring credentials

Add s3 credentials to CanDIG POSTing to the ingest/s3-credential endpoint with a JSON body following the template below:

{
  "endpoint": "<s3 compatible storage url>", # url of the s3 compatible storage
  "bucket": "the-name-of-the-bucket", # name of the bucket within that storage
  "access_key": "the-read-only-access-key",
  "secret_key": "the-S3-secret-key"
}

Template curl command:

curl --request POST \
  --url $CANDIG_URL'/ingest/s3-credential' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer '$TOKEN \
  -d '{"endpoint": "<s3 compatible storage url>", "bucket": "the-name-of-the-bucket", "access_key": "the-read-only-access-key", "secret_key": "the-S3-secret-key"}'

The Bearer $TOKEN must be obtained by a site admin user.

Once this configuration is completed, you can refer to files in the configured s3 bucket by their s3 url in your genomic json file.

Configuring an NFS mount

To configure an NFS mount you will need to:

Get permissions and assistance from system administrators to mount the volumes to the server running the CanDIG stack
Add the relevant volumes to the htsget_app docker-compose file
Ensure the relevant user inside the docker container has the right permissions to access the mounts and databases

The volumes section in the docker compose file should look something like the following which will give read only access to each listed folder:

volumes :
  - htsget-data:/data
  - /source/folder/path/dir1:/data/dir1:ro
  - /source/folder/path/dir2:/data/dir2:ro
  - /source/folder/path/dir3:/data/dir3:ro

After making this change you will need to recompose the htsget container.

To test that htsget_app instance can see the mount as local files run:

docker exec -it candigv2_htsget_1 ls /data/dir1

If volumes are mounted this way, and files are inside the provided directory, you will need to specify the access methods for each file in the Genomic JSON file

The paths will be something like (note the three slashes ///):

file:///data/dir1/my_bam_file.bam

Once you have a confirmed the location for all genomic files, you can move on to creating the linking file where these file paths will be used to populate the access_methods in the genomic JSON file.

B. Create Genomic JSON file

To make the links between genomic files and the Sample registration objects they derived from as well as to show htsget_app where the files are located, a JSON file must be created that specifies these relationships.

Metadata about each genomic file should be specified in a JSON file.

The file should contain an array of dictionaries, where each item represents a single file. Each dictionary specifies important information about the genomic file and how it links to the ingested clinical data. The structure of this dictionary is specified in the SequencingIngest openapi schema, an example file exists within the test files and a commented example in the expandable section below:

Example genomic json file

{
    "experiments": [
        {
            "program_id": "SYNTHETIC-2",      # The name of the program
            "experiment_id": "SEQ_0001",      # the name given by the sequencing centre
            "submitter_sample_id": "SAMPLE_REGISTRATION_1", # The submitter_sample_id to link to
            "metadata": {
                "library_strategy": "WGS"     # type of data sequenced (whole genome or whole transcriptome), allowed values: [WGS, WTS]
            }
        },
        {
            "program_id": "SYNTHETIC-2",
            "experiment_id": "SEQ_0002",
            "submitter_sample_id": "SAMPLE_REGISTRATION_4",
            "metadata": {
                "library_strategy": "WGS"
            }
        },
        {
            "program_id": "SYNTHETIC-2",
            "experiment_id": "SEQ_0003",
            "submitter_sample_id": "SPECIMEN_5",
            "metadata": {
                "library_strategy": "WGS"
            }
        }
    ],
    "analyses": [
        {   ## Example linking to genomic and index files in s3 storage to a single sample
            "program_id": "SYNTHETIC-2",      # The name of the program
            "analysis_id": "HG00096.cnv",     # The identifier used to identify the genomic file, usually the filename, minus extensions
            "main": {                         # location and name of the main genomic file, bam/cram/vcf
                "access_method": "s3://s3.us-east-1.amazonaws.com/1000genomes/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz?public=true",
                "name": "HG00096.cnv.vcf.gz"
            },
            "index": {                        # location and name of the index for the main genomic file, bai/crai/
                "access_method": "s3://s3.us-east-1.amazonaws.com/1000genomes/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz?public=true",
                "name": "HG00096.cnv.vcf.gz.tbi"
            },
            "metadata": {                     # Metadata about the file
                "analysis_type": "sequence_variation",   # type of data represented
                "analysis_attribute": {
                    "subtype": "mutation"
                },
                "reference": "hg37"           # which reference genome was used for alignment, allowed values: [hg37, hg38]
            },
            "samples": [                      # Linkage to one or more samples that the genomic file was derived from
                {
                    "analysis_sample_id": "HG00096",  # The name of the sample in the genomic file
                    "experiment_id": "SEQ_0001"       # The experiment ID to link to
                }
            ]
        },
        {  ## Example linking genomic and index files in local storage to multiple samples
            "program_id": "SYNTHETIC-2",
            "analysis_id": "multisample",
            "main": {
                "access_method": "file:////app/htsget_server/data/files/multisample_1.vcf.gz",
                "name": "multisample_1.vcf.gz"
            },
            "index": {
                "access_method": "file:////app/htsget_server/data/files/multisample_1.vcf.gz.tbi",
                "name": "multisample_1.vcf.gz.tbi"
            },
            "metadata": {
                "analysis_type": "sequence_variation",   # type of data represented
                "analysis_attribute": {
                    "subtype": "mutation"
                },
                "reference": "hg37"
            },
            "samples": [
                {
                    "analysis_sample_id": "TUMOR",
                    "experiment_id": "SEQ_0002"
                },
                {
                    "analysis_sample_id": "NORMAL",
                    "experiment_id": "SEQ_0003"
                }
            ]
        }
    ]
}

Tips for creating the Genomic JSON file

analysis_id is the filename of the variation file (e.g. HG00096.vcf.gz, HG00096.bam)
Access methods can either be of the format s3://[endpoint]/[bucket name] or file:///[directory relative to root on htsget container].
submitter_sample_id(s) are the (mandatory) links to the `Sample Registration objects uploaded during clinical data ingest.
index is the file location and name of the index file; for instance a tabix (tbi) or cram index (crai)
If an S3 bucket access method is provided, assuming you have properly added the S3 credentials to vault (see above), the service will scan the S3 bucket to ensure the relevant files are present.
There is no validation that the genomic files exist locally or are mounted to htsget. If the local (file:///) method is used it is important to check all files are present before proceeding with ingest.

Genomic file types

For each genomic data file that is referenced, there must also be the appropriate index file for the file type specified. The current files supported and expected values in the genomic linking JSON file are:

filetype	expected file extension	expected index file	expected `analysis_type`
Variant Call Format (VCF)	`.vcf` `.vcf.gz`	`.tbi` `.tbi.gz`	`sequence_variation`
Binary Alignment Map (BAM)	`.bam`	`.bai`	`reference_alignment`
Compressed Reference-oriented Alignment Map (CRAM)	`.cram`	`.crai`	`reference_alignment`

It is recommended to write a simple script to automate the creation of the Genomic JSON file. JSON validation of this file will occur before genomic data ingest proceeds. We are also working on tools to help create this file, so if you need help, reach out to the CanDIG team.