Prepare genomic data
Genomic data ingest consists of storing file metadata and pointers to genomic files within the htsget_app micro-service. Your CanDIG instance needs to be connected to the place where your genomic data is stored - CanDIG does not duplicate these files into internal separate storage.
A. Configure connection to Genomic Data source/s
Systems may need information in the drop-downs below:
Configuring s3 authorisation
Pre-requisites:
- All files in s3-compatible storage bucket
- Access and secret keys for the s3 storage bucket - it is recommended this has read access only
- Be a site admin user and be able to get an authorization token
Configuring credentials
Add s3 credentials to CanDIG POSTing to the ingest/s3-credential
endpoint with a JSON body following the template below:
Template curl command:
The Bearer $TOKEN
must be obtained by a site admin user.
Once this configuration is completed, you can refer to files in the configured s3 bucket by their s3 url in your genomic json file.
Configuring an NFS mount
To configure an NFS mount you will need to:
- Get permissions and assistance from system administrators to mount the volumes to the server running the CanDIG stack
- Add the relevant volumes to the htsget_app docker-compose file
- Ensure the relevant user inside the docker container has the right permissions to access the mounts and databases
The volumes section in the docker compose file should look something like the following which will give read only access to each listed folder:
After making this change you will need to recompose the htsget container.
To test that htsget_app instance can see the mount as local files run:
If volumes are mounted this way, and files are inside the provided directory, you will need to specify the access methods for each file in the Genomic JSON file
The paths will be something like (note the three slashes ///
):
Once you have a confirmed the location for all genomic files, you can move on to creating the linking file where these file paths will be used to populate the access_method
s in the genomic JSON file.
B. Create Genomic JSON file
To make the links between genomic files and the Sample registration objects they derived from as well as to show htsget_app
where the files are located, a JSON file must be created that specifies these relationships. The ingest README provides instructions and examples to demonstrate how the file should be structured.
For each genomic data file that is referenced, there must also be the appropriate index file for the file type specified. The current files supported and expected values in the genomic linking JSON file are:
filetype | expected file extension | expected index file | expected data_type |
---|---|---|---|
Variant Call Format (VCF) | .vcf .vcf.gz | .tbi .tbi.gz | variant |
Binary Alignment Map (BAM) | .bam | .bai | read |
Compressed Reference-oriented Alignment Map (CRAM) | .cram | .crai | read |
It is recommended to write a simple script to automate the creation of the Genomic JSON file. JSON validation of this file will occur before genomic data ingest proceeds. We are also working on tools to help create this file, so if you need help, reach out to the CanDIG team.