Skip to content

Ingest genomic data

Now that you have a genomic JSON that is ready for ingest, it can be POSTed as a body to the /ingest/ingest endpoint.

The format of the ingest request is specified by the SequencingIngest schema in candigv2-ingest. The schema specifies two types of objects:

  • Experiments link the sequencing centre’s names with the MoHCCN clinical data’s Sample Registration identifiers
  • Analyses describe the particular downstream analysis performed, along with the files associated with the analysis.

For example, a request which would ingest a sequence variation analysis would look like below. It links the clinical Sample Registration sample_registration_id_1 with a sequencing centre Experiment and associates the analysis with the pointers to a gzipped VCF file along with its associated tabix index file which are located within s3 storage:

{
"experiments": [
{
"program_id": "string",
"experiment_id": "SEQ_0001",
"submitter_sample_id": "LOCAL-sample_registration_id_1",
"metadata": {
"library_strategy": "WGS"
}
}
],
"analyses": [
{
"program_id": "string",
"analysis_id": "HG00096",
"metadata": {
"analysis_type": "sequence_variation",
"reference": "hg38"
},
"main": {
"name": "string",
"access_method": "s3://s3.us-east-1.amazonaws.com/1000genomes/HG00096.vcf.gz"
},
"index": {
"name": "string",
"access_method": "s3://s3.us-east-1.amazonaws.com/1000genomes/HG00096.tbi"
},
"samples": [
{
"experiment_id": "SEQ_0001",
"analysis_sample_id": "HG00096"
}
]
}
]
}

You then send a POST request, specifying the path to a file that contains the genomic file information, e.g.

Terminal window
curl -X 'POST' \
$CANDIG_URL'/ingest/ingest' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer '$TOKEN \
-d '@/absolute/path/to/genomic.json>

The post request should return with a queue id that can be used to check the status of ingest.

Terminal window
"queue_id": "bd36048e-8661-11ef-99d4-0242ac12000f",

You can check the status of ingest by using the ingest status endpoint:

Terminal window
curl -X 'GET' \
$CANDIG_URL'/ingest/status/<your_queue_id>' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer '$TOKEN

While ingest is processing, you will see a message "status": "still in queue".

If you get a message such as "no such sample": "sample SAMPLE_0600 does not exist in clinical data' this means the sample the file is linking to cannot be found in the currently ingested clinical data. Please ensure you have ingested all samples being referred to in the genomic json file before attempting ingest.

Ingest of the linkage between the clinical data and existence of the genomic/transcriptomic files completes fairly quickly and will be ingested into CanDIG (~7K files in about 30 mins). However, in order to facilitate searching, variant files need to be indexed which creates hundreds of millions of database rows and is a fairly slow process (~7K files in about 6 days, or about 1K files per day).

There are several ways you can monitor the indexing queue and ensure that indexing is proceeding as expected.

The htsget logs can be inspected when you are logged into the virtual machine where the CanDIG stack is running with the following command:

Terminal window
docker logs candigv2_htsget_1 --since 30m

You should be able to find messages relating to indexing that are similar to the lines below:

Terminal window
level: INFO, file: /app/htsget_server/htsget_server/indexing.py, log: {'message': 'local-test starting indexing'}
level: INFO, file: /app/htsget_server/htsget_server/indexing.py, log: {'message': 'local-test indexed 71 headers'}
level: INFO, file: /app/htsget_server/htsget_server/indexing.py, log: {'message': 'local-test indexed 2 samples in file'}
level: INFO, file: /app/htsget_server/htsget_server/indexing.py, log: {'message': 'local-test writing 6 entries to db'}
level: INFO, file: /app/htsget_server/htsget_server/indexing.py, log: {'message': 'local-test indexing done'}
level: INFO, file: /app/htsget_server/htsget_server/indexing.py, log: {'message': {'message': 'Indexing complete for variantfile local-test'}}
level: INFO, file: /app/htsget_server/htsget_server/indexing.py, log: {'message': 'indexing local-SYNTH_01~local-multisample_2, 4 files left in indexing queue. For full list of files to index, run: `docker exec candigv2_htsget_1 ls /home/candig/tmp/indexer`'}

2. Checking the file queue in the htsget container

Section titled “2. Checking the file queue in the htsget container”

The indexer works through a queue of text files that are stored in the htsget container. If this folder is empty, it either means that indexing is complete, or something went wrong and no files were added to the queue.

To check the queue run the command:

Terminal window
docker exec candigv2_htsget_1 ls ../../home/candig/tmp/indexer

or to simply count the files in the folder:

Terminal window
docker exec candigv2_htsget_1 ls ../../home/candig/tmp/indexer | wc -l

You can check the overall progress of all files on a program-by-program basis using the following call:

Terminal window
curl --request GET \
--url $CANDIG_URL'/drs/ga4gh/drs/v1/programs/'$PROGRAM_ID'/status' \
--header 'Authorization: Bearer '$TOKEN

And you will get a response back like:

Terminal window
curl --request GET \
--url http://candig.docker.internal:5080/drs/ga4gh/drs/v1/programs/local-SYNTH_01/status \
--header 'Authorization: Bearer TOKEN'
{
"index_complete": [
"drs://candig.docker.internal:5080/drs/local-test",
"drs://candig.docker.internal:5080/drs/local-multisample_2",
"drs://candig.docker.internal:5080/drs/local-HG02102-all"
],
"index_errored": [],
"index_in_progress": []
}