Prepare genomic data
Genomic data ingest consists of storing file metadata and pointers to genomic files within the htsget_app micro-service. Your CanDIG instance needs to be connected to the place where your genomic data is stored - CanDIG does not duplicate these files into internal separate storage.
A. Configure connection to Genomic Data source/s
Systems may need information in the drop-downs below:
Configuring s3 authorisation
Pre-requisites:
- All files in s3-compatible storage bucket
- Access and secret keys for the s3 storage bucket - it is recommended this has read access only
- Be a site admin user and be able to get an authorization token
Configuring credentials
Add s3 credentials to CanDIG POSTing to the ingest/s3-credential
endpoint with a JSON body following the template below:
{ "endpoint": "<s3 compatible storage url>", # url of the s3 compatible storage "bucket": "the-name-of-the-bucket", # name of the bucket within that storage "access_key": "the-read-only-access-key", "secret_key": "the-S3-secret-key"}
Template curl command:
curl --request POST \ --url $CANDIG_URL'/ingest/s3-credential' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -H 'Authorization: Bearer '$TOKEN \ -d '{"endpoint": "<s3 compatible storage url>", "bucket": "the-name-of-the-bucket", "access_key": "the-read-only-access-key", "secret_key": "the-S3-secret-key"}'
The Bearer $TOKEN
must be obtained by a site admin user.
Once this configuration is completed, you can refer to files in the configured s3 bucket by their s3 url in your genomic json file.
Configuring an NFS mount
To configure an NFS mount you will need to:
- Get permissions and assistance from system administrators to mount the volumes to the server running the CanDIG stack
- Add the relevant volumes to the htsget_app docker-compose file
- Ensure the relevant user inside the docker container has the right permissions to access the mounts and databases
The volumes section in the docker compose file should look something like the following which will give read only access to each listed folder:
volumes : - htsget-data:/data - /source/folder/path/dir1:/data/dir1:ro - /source/folder/path/dir2:/data/dir2:ro - /source/folder/path/dir3:/data/dir3:ro
After making this change you will need to recompose the htsget container.
To test that htsget_app instance can see the mount as local files run:
docker exec -it candigv2_htsget_1 ls /data/dir1
If volumes are mounted this way, and files are inside the provided directory, you will need to specify the access methods for each file in the Genomic JSON file
The paths will be something like (note the three slashes ///
):
file:///data/dir1/my_bam_file.bam
Once you have a confirmed the location for all genomic files, you can move on to creating the linking file where these file paths will be used to populate the access_method
s in the genomic JSON file.
B. Create Genomic JSON file
To make the links between genomic files and the Sample registration objects they derived from as well as to show htsget_app
where the files are located, a JSON file must be created that specifies these relationships.
Metadata about each genomic file should be specified in a JSON
file.
The file should contain an array of dictionaries, where each item represents a single file. Each dictionary specifies important information about the genomic file and how it links to the ingested clinical data. The structure of this dictionary is specified in the ingest openapi schema, an example file exists within the test files and a commented example in the expandable section below:
Example genomic json file
[ { ## Example linking to genomic and index files in s3 storage to a single sample "program_id": "SYNTHETIC-2", # The name of the program "genomic_file_id": "HG00096.cnv", # The identifier used to identify the genomic file, usually the filename, minus extensions "main": { # location and name of the main genomic file, bam/cram/vcf "access_method": "s3://s3.us-east-1.amazonaws.com/1000genomes/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz?public=true", "name": "HG00096.cnv.vcf.gz" }, "index": { # location and name of the index for the main genomic file, bai/crai/ "access_method": "s3://s3.us-east-1.amazonaws.com/1000genomes/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz?public=true", "name": "HG00096.cnv.vcf.gz.tbi" }, "metadata": { # Metadata about the file "sequence_type": "wgs", # type of data sequenced (whole genome or whole transcriptome), allowed values: [wgs, wts] "data_type": "variant", # type of data represented, allowed values: [variant, read] "reference": "hg37" # which reference genome was used for alignment, allowed values: [hg37, hg38] }, "samples": [ # Linkage to one or more samples that the genomic file was derived from { "genomic_file_sample_id": "HG00096", # The name of the sample in the genomic file "submitter_sample_id": "SAMPLE_REGISTRATION_1" # The submitter_sample_id to link to } ] }, { ## Example linking genomic and index files in local storage to multiple samples "program_id": "SYNTHETIC-2", "genomic_file_id": "multisample", "main": { "access_method": "file:////app/htsget_server/data/files/multisample_1.vcf.gz", "name": "multisample_1.vcf.gz" }, "index": { "access_method": "file:////app/htsget_server/data/files/multisample_1.vcf.gz.tbi", "name": "multisample_1.vcf.gz.tbi" }, "metadata": { "sequence_type": "wgs", "data_type": "variant", "reference": "hg37" }, "samples": [ { "genomic_file_sample_id": "TUMOR", "submitter_sample_id": "SAMPLE_REGISTRATION_4" }, { "genomic_file_sample_id": "NORMAL", "submitter_sample_id": "SPECIMEN_5" } ] }]
Tips for creating the Genomic JSON file
genomic_file_id
is the filename of the variation file (e.g. HG00096.vcf.gz, HG00096.bam)- Access methods can either be of the format
s3://[endpoint]/[bucket name]
orfile:///[directory relative to root on htsget container]
. submitter_sample_id
(s) are the (mandatory) links to the `Sample Registration objects uploaded during clinical data ingest.index
is the file location and name of the index file; for instance a tabix (tbi
) or cram index (crai
)- If an S3 bucket access method is provided, assuming you have properly added the S3 credentials to vault (see above), the service will scan the S3 bucket to ensure the relevant files are present.
- There is no validation that the genomic files exist locally or are mounted to htsget. If the local (
file:///
) method is used it is important to check all files are present before proceeding with ingest.
Genomic file types
For each genomic data file that is referenced, there must also be the appropriate index file for the file type specified. The current files supported and expected values in the genomic linking JSON file are:
filetype | expected file extension | expected index file | expected data_type |
---|---|---|---|
Variant Call Format (VCF) | .vcf .vcf.gz | .tbi .tbi.gz | variant |
Binary Alignment Map (BAM) | .bam | .bai | read |
Compressed Reference-oriented Alignment Map (CRAM) | .cram | .crai | read |
It is recommended to write a simple script to automate the creation of the Genomic JSON file. JSON validation of this file will occur before genomic data ingest proceeds. We are also working on tools to help create this file, so if you need help, reach out to the CanDIG team.