Ingest genomic data
Now that you have a genomic JSON that is ready for ingest, it can be POSTed as a body to the /ingest/ingest endpoint.
The format of the ingest request is specified by the SequencingIngest schema in candigv2-ingest. The schema specifies two types of objects:
- Experiments link the sequencing centre’s names with the MoHCCN clinical data’s Sample Registration identifiers
- Analyses describe the particular downstream analysis performed, along with the files associated with the analysis.
For example, a request which would ingest a sequence variation analysis would look like below. It links the clinical Sample Registration sample_registration_id_1 with a sequencing centre Experiment and associates the analysis with the pointers to a gzipped VCF file along with its associated tabix index file which are located within s3 storage:
{ "experiments": [ { "program_id": "string", "experiment_id": "SEQ_0001", "submitter_sample_id": "LOCAL-sample_registration_id_1", "metadata": { "library_strategy": "WGS" } } ], "analyses": [ { "program_id": "string", "analysis_id": "HG00096", "metadata": { "analysis_type": "sequence_variation", "reference": "hg38" }, "main": { "name": "string", "access_method": "s3://s3.us-east-1.amazonaws.com/1000genomes/HG00096.vcf.gz" }, "index": { "name": "string", "access_method": "s3://s3.us-east-1.amazonaws.com/1000genomes/HG00096.tbi" }, "samples": [ { "experiment_id": "SEQ_0001", "analysis_sample_id": "HG00096" } ] } ]}You then send a POST request, specifying the path to a file that contains the genomic file information, e.g.
curl -X 'POST' \ $CANDIG_URL'/ingest/ingest' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -H 'Authorization: Bearer '$TOKEN \ -d '@/absolute/path/to/genomic.json>The post request should return with a queue id that can be used to check the status of ingest.
"queue_id": "bd36048e-8661-11ef-99d4-0242ac12000f",You can check the status of ingest by using the ingest status endpoint:
curl -X 'GET' \ $CANDIG_URL'/ingest/status/<your_queue_id>' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -H 'Authorization: Bearer '$TOKENWhile ingest is processing, you will see a message "status": "still in queue".
If you get a message such as "no such sample": "sample SAMPLE_0600 does not exist in clinical data' this means the sample the file is linking to cannot be found in the currently ingested clinical data. Please ensure you have ingested all samples being referred to in the genomic json file before attempting ingest.
Checking file indexing progress
Section titled “Checking file indexing progress”Ingest of the linkage between the clinical data and existence of the genomic/transcriptomic files completes fairly quickly and will be ingested into CanDIG (~7K files in about 30 mins). However, in order to facilitate searching, variant files need to be indexed which creates hundreds of millions of database rows and is a fairly slow process (~7K files in about 6 days, or about 1K files per day).
There are several ways you can monitor the indexing queue and ensure that indexing is proceeding as expected.
1. Looking at the htsget logs
Section titled “1. Looking at the htsget logs”The htsget logs can be inspected when you are logged into the virtual machine where the CanDIG stack is running with the following command:
docker logs candigv2_htsget_1 --since 30mYou should be able to find messages relating to indexing that are similar to the lines below:
level: INFO, file: /app/htsget_server/htsget_server/indexing.py, log: {'message': 'local-test starting indexing'}level: INFO, file: /app/htsget_server/htsget_server/indexing.py, log: {'message': 'local-test indexed 71 headers'}level: INFO, file: /app/htsget_server/htsget_server/indexing.py, log: {'message': 'local-test indexed 2 samples in file'}level: INFO, file: /app/htsget_server/htsget_server/indexing.py, log: {'message': 'local-test writing 6 entries to db'}level: INFO, file: /app/htsget_server/htsget_server/indexing.py, log: {'message': 'local-test indexing done'}level: INFO, file: /app/htsget_server/htsget_server/indexing.py, log: {'message': {'message': 'Indexing complete for variantfile local-test'}}level: INFO, file: /app/htsget_server/htsget_server/indexing.py, log: {'message': 'indexing local-SYNTH_01~local-multisample_2, 4 files left in indexing queue. For full list of files to index, run: `docker exec candigv2_htsget_1 ls /home/candig/tmp/indexer`'}2. Checking the file queue in the htsget container
Section titled “2. Checking the file queue in the htsget container”The indexer works through a queue of text files that are stored in the htsget container. If this folder is empty, it either means that indexing is complete, or something went wrong and no files were added to the queue.
To check the queue run the command:
docker exec candigv2_htsget_1 ls ../../home/candig/tmp/indexeror to simply count the files in the folder:
docker exec candigv2_htsget_1 ls ../../home/candig/tmp/indexer | wc -l3. Checking a program’s indexing status
Section titled “3. Checking a program’s indexing status”You can check the overall progress of all files on a program-by-program basis using the following call:
curl --request GET \ --url $CANDIG_URL'/drs/ga4gh/drs/v1/programs/'$PROGRAM_ID'/status' \ --header 'Authorization: Bearer '$TOKENAnd you will get a response back like:
curl --request GET \ --url http://candig.docker.internal:5080/drs/ga4gh/drs/v1/programs/local-SYNTH_01/status \ --header 'Authorization: Bearer TOKEN'{ "index_complete": [ "drs://candig.docker.internal:5080/drs/local-test", "drs://candig.docker.internal:5080/drs/local-multisample_2", "drs://candig.docker.internal:5080/drs/local-HG02102-all" ], "index_errored": [], "index_in_progress": []}