Ingest genomic data

It is preferred to use the API to post the genomic JSON as a body to the /ingest/ingest endpoint.

The format of the ingest request is specified by the SequencingIngest schema in candigv2-ingest. The schema specifies two types of objects:

Experiments link the sequencing centre’s names with the MoHCCN clinical data’s Sample Registration identifiers
Analyses describe the particular downstream analysis performed, along with the files associated with the analysis.

For example, a request which would ingest a sequence variation analysis would look like below. It links the clinical Sample Registration sample_registration_id_1 with a sequencing centre Experiment and associates the analysis with the pointers to a gzipped VCF file along with its associated tabix index file which are located within s3 storage:

{
  "experiments": [
    {
      "program_id": "string",
      "experiment_id": "SEQ_0001",
      "submitter_sample_id": "LOCAL-sample_registration_id_1",
      "metadata": {
        "library_strategy": "WGS"
      }
    }
  ],
  "analyses": [
    {
      "program_id": "string",
      "analysis_id": "HG00096",
      "metadata": {
        "analysis_type": "sequence_variation",
        "reference": "hg38"
      },
      "main": {
        "name": "string",
        "access_method": "s3://s3.us-east-1.amazonaws.com/1000genomes/HG00096.vcf.gz"
      },
      "index": {
        "name": "string",
        "access_method": "s3://s3.us-east-1.amazonaws.com/1000genomes/HG00096.tbi"
      },
      "samples": [
        {
          "experiment_id": "SEQ_0001",
          "analysis_sample_id": "HG00096"
        }
      ]
    }
  ]
}

You then send a POST request, specifying the path to a file that contains the genomic file information, e.g.

curl -X 'POST' \
  $CANDIG_URL'/ingest/ingest' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer '$TOKEN \
  -d '@/absolute/path/to/genomic.json>

The post request should return with a queue id that can be used to check the status of ingest.

"queue_id": "bd36048e-8661-11ef-99d4-0242ac12000f",

You can check the status of ingest by using the ingest status endpoint:

curl -X 'GET' \
  $CANDIG_URL'/ingest/status/<your_queue_id>' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer '$TOKEN

While ingest is processing, you will see a message "status": "still in queue".

If you get a message such as "no such sample": "sample SAMPLE_0600 does not exist in clinical data' this means the sample the file is linking to cannot be found in the currently ingested clinical data. Please ensure you have ingested all samples being referred to in the genomic json file before attempting ingest.