Sequencing de novo SARS-CoV-2 virus causing acute respiratory infections COVID-19 using the new generation sequencing system PacBio Sequel
To carry out this task, the Institute of Biotechnology has collaborated with the Pasteur Institute of Ho Chi Minh City and the National Institute of Hygiene and Epidemiology (NIHE) to develop a technical process for sequencing the entire viral genome of SARS-CoV-2 using PacBio's long-segment sequencing technology. The task was completed after one year of implementation with the successful construction of a 6-step RNA virus genome sequencing process, as follows: (1) Raising and isolation of viral RNA. (2) Synthesis of double-stranded cDNA from viral RNA. These two steps were carried out at the Pasteur Institute in Ho Chi Minh City and NIHE in clean room conditions of third class biosafety. (3) Preparation of the DNA library for gene sequencing. (4) Sequencing of the entire SARS-CoV-2 virus genome. (5) De novo assembly of the viral genome. (6) Annotation and analysis of the viral genome. The final four steps were performed at the Institute of Biotechnology. The time to perform these four steps was about 48 hours.
Regarding the results, the project sequenced the entire genomes of four strains of SARS-CoV-2 virus with a length of over 29,500 nucleotides/genome, and successfully annotated 14 viral ORFs. Genome assembly results had no read errors or blanks. Sequencing quality reached Q40 (equivalent to 99.99% accuracy). The analysis results showed that the virus strain isolated by the Pasteur Institute in Ho Chi Minh City contained 10 mutations related to genes encoding Nsp2, Nsp3, RNA primase, helicase, protein S, and protein N. This was a virus strain isolated from the Vietnamese patient returning from Pennsylvania, USA, on March 15, 2020, who landed in Ho Chi Minh City on March 17, 2020. The remaining three virus samples provided by NIHE all originated from the outbreak at Bach Mai Hospital collected on March 25 and 28, 2020. These strains contained 5 of the same mutations and one strain contained 6. The mutations involved genes encoding Nsp3, RNA primase, S and N proteins. All four strains of the virus contained the D614G mutation in the S protein.
The genomes of the four virus strains in the study sequenced by the Institute of Biotechnology were analyzed and compared with the sequences of virus samples made by other units in the country, including the Oxford University Clinical Research Unit (OUCRU), the Pasteur Institute in Ho Chi Minh City and NIHE. Analysis based on sequences uploaded to the GISAID database up to August 25, 2020 (75 sequences in total) revealed a distinct segregation of strains by time and location, as well as presence of 6 GISAID (clade) classification groups L, S, V, G, GR and GH in Vietnam in 2020.
The distribution of virus groups in Vietnam was greatly influenced by the strains circulating in the world: strains belonging to groups S, L, and V account for the majority of people who returned from China or are related with Asian countries, where there was a lot of trade with China in January and February 2020; The GH strain was strongly associated with return cases from North America, while the GR strain was from the European region.
In particular, the strains sequenced in this study all had mutations similar to strains of European and American origin, circulating since March 2020 - the time when strain G (carrying mutation D614G) of the virus SARS-CoV-2 began to spread worldwide. The analysis results also show that the strains provided by the Pasteur Institute in Ho Chi Minh City are in the GH group, circulating mainly in North America, while the three strains provided by the NIHE in the Bach Mai outbreak are in the GR group showing European origin, with the possibility of transmission from the wave of many people entering Hanoi in early March 2020.
Figure 1: Phylogenetic tree of SARS-CoV-2 strains collected in Vietnam until April 1, 2021. The NCBI reference MN908947.3 was also included as a comparison (black), while the SARS-CoV-2 genomes analyzed and provided by the Institute of Biotechnology were included in the comparison (boxed in red). Sequences are colored according to the GISAID classification. The taxonomic tree shows the evolution of SARS-CoV-2 as well as the time when the strains entered Vietnam one by one.
The results of comparing the genome sequences of the virus strains circulating in Vietnam until April 1, 2021 (shown in Figure 1) show that there are currently 8 groups in Vietnam (clade S, L, V, G, GR, GH, GV and GRY) of the SARS-CoV-2 virus according to the GISAID classification with dozens of different variants.
The successful application of PacBio's long-segment genome sequencing technique to SARS-CoV-2 virus opens the possibility of rapid and accurate viral genome sequencing without relying on international reference on genome sequences. This allows Vietnamese scientists to sequence new viral pathogens in the future without needing a reference genome. Genome sequencing data contributes to determining the origin of the virus and the number of sources of infection (F0) in the outbreaks, and is a scientific basis and important information in the development of strategies and plans for effective prevention and control of virus spread in the community.
With the mastery of the existing technological process, capacity and conditions, the Vietnam Academy of Science and Technology is ready to participate in cooperation with health sector units in sequencing the genome of the SARS-CoV-2 virus with a large scale in urgent cases.
Translated by Phuong Huyen
Link to Vietnamese version