APPLICATION OF SYSTEMS BIOLOGY APPROACH FOR INVESTIGATION OF PAN GENOME AND PHYLOGENETICS IN VARIOUS STRAINS OF ANAPLASMA PHAGOCYTOPHILUM

Anaplasma phagocytophilum, has been observed as an emerging human pathogen of public health importance and is commonly transmitted to humans by tick bites. The bacterium Anaplasma phagocytophilum has been known from decades to cause the disease, tick-borne fever (TBF) in domestic ruminants and cattles in various areas in northern Europe, China, Russian border and United Kingdom. In recent years, outbreak of A.phagocytophilum infection has enhanced multifold and is widely reported in I. persulcatus and engorged as D. silvarum ticks in north eastern regions of China. However, few genome sequences have been completed so far, thus observations on biological, ecological, and pathological differences between genotypes of this bacterium,are yet to be elucidated by molecular and experimental infection studies.
In our current work, We have investigated 4 completely sequenced Genomic strains of A.phagocytophilum using various insilico tools SPINE and AGENT to characterize the percentage of Core Genome, Accessory genome and Pan Genome of these species. Further, we have tried to characterize the serotype and find the resistance genes observed in these four strains using MLST and ResFinder tools available at centre of Genetic epidemiology, Denmark Technical University server. By application of ClustAGE tool, we have made a comparative assessment of accessory genes across these strains. Heatmaps using expression map of these 4 genomes was constructed to infer the conserved genes and variable genes across these strains. Our study led to conclusion that core genome across these strains varies from 1.43 Mbps to 1.47 Mbps and accessory genomes varies from 0.0410 Mbps to 0.0734 Mbps. Comparison of the Gene clusters led to conclusion that gene clusters led to core genome value of 318 and Pan Genome value of 1035. Our analysis characterizes the dominance of accessory genes during evolution of Anaplasma phagocytophilum and lesser conservation of genes as there is a phylogenetic variation observed.


INTRODUCTION
Anaplasma phagocytophilum is a causative agent for human granulocytic anaplasmosis (HGA), a significant tick-borne zoonosis rising in the United States and other portions of the world (Xiong et al, 2019). Anaplasma phagocytophilum, influences a few types of wild and tamed warm-blooded animals. The specialty for A. Phagocytophilum, the neutrophil, demonstrates that the pathogen has special adjustments and pathogenetic mechanisms. HGA is progressively perceived as a significant and successive reason for fever after a tick bite in the Upper Midwest, New England, parts of the Mid-Atlantic States, and many parts of Europe, all territories where Ixodes ticks bite people (Mghirbi, Youmna ,2012 ). A. phagocytophilum genomes are currently available, of which just four are complete (Dunning Hotopp JC, Lin M, Madupu R et al, 2006).Apart from Norway Variant 2, obtained from a Norwegian sheep, all genomes correspond to North American strains: human strains HZ, HZ2, and HGE1, Dog2 dog strain, MRK horse strain, JM rodent strain, and the tick (Ixodes scapularis) strains CRT38 and CRT35. The number of anaplasmosis cases reported to CDC has increased steadily since the disease became reportable, from 348 cases in 2000, to a peak of 5,762 in 2017 (CDC Reports, 2017). However, cases reported in 2018 were substantially lower. The case fatality rate (i.e., the proportion of anaplasmosis patients that reportedly died as a result of infection) has remained low, at less than 1%.Molecular modelling and computational chemistry approaches are applied to model the proteins. (Adejoro I.A et al, 2012). In silico prediction of biological activity using PASS in relation to the chemical structure of a compound is now a commonly used technique in drug discovery and development to predict the biological activity spectrum for a compound on the basis of its structural formula. (Valli G., Ramu K., Mareeswari P.,2012) In South Korea, Ixodes spp. ticks are uncommon.( Park SW, Song BG, Shin EH, Yun SM,2014) However, A. phagocytophilum has been demonstrated in Haemaphysalis longicornis ticks, which are the most abundant species in South Korea (Kim CM, Kim MS, Park MS, Park JH, Chae JS, 2003) and this has led to growing concern about the possible emergence of HGA in South Korea. Actually, recent seroprevalence studies have shown that 1.8% of serum samples from febrile patients were positive for A. phagocytophilum in an immunofluorescence assay (IFA) test in 2002, and in 2003 the percentage was 8.9% from patients with symptoms of high fever suspected mainly scrub typhus. Many of the Thiohydantoine derivatives which are known to possess antitumorous activity have been tested against strains of Anaplasma. (Nillohit Mitra Ray, Rahul Singh, et al 2020) Comparative genomic analyses lend insight into structural features such as variations related to genomic rearrangements, changes in the gene repertory, identification of horizontal gene transfer elements and prophage-related sequences, and hence expose particularities on the evolution in species. ( Daniel Castillo et al, 2016) In this work, we have used various insilico tools to investigate Core genome and Pan Genome of Anaplasma species and further applied ClustAGE to identify conserved accessory elements across these species.

OBTAINING GENOMIC DATA OF STRAINS
Complete genomic sequences of 4 strains , Anaplasma phagocytophilum str HZ, RefSeq NC_007797.1,Genomic assembly GCA_000013125.1; Anaplasma phagocytophilum str HZ2, RefSeq NC_021879,Genomic assembly GCA_000439755.1; Anaplasma phagocytophilum str JM, RefSeq NC_021880.1,Genomic assembly GCA_000439775.1; Anaplasma phagocytophilum str. Norway variant2, RefSeq NZ_CP015376.1,Genomic assembly GCA_000689635.2 was downloaded from FTP site of National centre for Biotechnology information. Feature count and feature details of these strains were saved for analysis using opensource online servers. Genomic size of HZ Anaplasma phagocytophilum, has been observed as an emerging human pathogen of public health importance and is commonly transmitted to humans by tick bites. The bacterium Anaplasma phagocytophilum has been known from decades to cause the disease, tick-borne fever (TBF) in domestic ruminants and cattles in various areas in northern Europe, China, Russian border and United Kingdom. In recent years, outbreak of A.phagocytophilum infection has enhanced multifold and is widely reported in I. persulcatus and engorged as D. silvarum ticks in north eastern regions of China. However, few genome sequences have been completed so far, thus observations on biological, ecological, and pathological differences between genotypes of this bacterium,are yet to be elucidated by molecular and experimental infection studies. In our current work, We have investigated 4 completely sequenced Genomic strains of A.phagocytophilum using various insilico tools SPINE and AGENT to characterize the percentage of Core Genome, Accessory genome and Pan Genome of these species. Further, we have tried to characterize the serotype and find the resistance genes observed in these four strains using MLST and ResFinder tools available at centre of Genetic epidemiology, Denmark Technical University server. By application of ClustAGE tool, we have made a comparative assessment of accessory genes across these strains. Heatmaps using expression map of these 4 genomes was constructed to infer the conserved genes and variable genes across these strains. Our study led to conclusion that core genome across these strains varies from 1.43 Mbps to 1.47 Mbps and accessory genomes varies from 0.0410 Mbps to 0.0734 Mbps. Comparison of the Gene clusters led to conclusion that gene clusters led to core genome value of 318 and Pan Genome value of 1035. Our analysis characterizes the dominance of accessory genes during evolution of Anaplasma phagocytophilum and lesser conservation of genes as there is a phylogenetic variation observed. strain is 1.471 Mbps, consists of 1,155 coding genes and has GC content of 41.6 percent, Genomic size of HZ2 strain is 1.477 Mbps, consists of 1,154 coding genes and has GC content of 41.6 percent ; Genomic size of JM strain is 1.481 Mbps, consists of 1,162 coding genes and has GC content of 41.6 percent while Genomic size of Norway variant2 strain (NV) is 1.545 Mbps, consists of 1,179 coding genes and has GC content of 41.7 percent.

SEROTYPING AND RESISTANCE GENES PROFILING
Batch upload of assembled genomic sequences of these 4 strains of A.phagocytophilum in Fasta format was done at Bacterial analysis pipeline (available at https://cge.cbs.dtu.dk/services/cge/) tool available at Centre for Genomic epidemiology. At cut off percentage of 80%, resistance genes present in these strains and serotyping of these strains with known bacterial strains was done.ResFinder is a prominent tool to find known existing resistance genes in microbial species and MLST (Kleinheinz KA, Joensen KG, Larsen MV, 2014)is used to detect for typing the species based on local alignment of query sequence with known target Kmers stored in the database of CGE. Based on maximum HSP length and percentage similarity, Species typing results are obtained.

ANALYSIS OF CORE GENOME, ACCESSORY GENOME AND PAN GENOME
Genomic sequences of all the four strains of A.phagocytophilum were submitted to Spine tool available at http://vfsmspineagent.fsm.northwestern.edu. in order to identify the core genome and accessory genome of each individual strain and generation of Pan Genome data as a whole. SPINE is a script written in Perl to identify core genome of various from genomic DNA sequences. It uses NUCmer parameters (Ozer EA, Allen JP, and Hauser AR, 2014). Further, AGEnt tool was used to find out accessory genome of each sequence. CORE genome data obtained as an output of SPINE was used as an input in AGENT. Accessory genome data generated by AGENT was further processed as an input file for ClustAGE application (Ozer, E.A, 2018). ClustAGE application is freely available at this site http://vfsmspineagent.fsm.northwestern.edu/cgibin/clustage.cgi to construct the map of accessory genes of various strains. We need to use the output generated by AGENT for accurate generation of ClustAGE map.Further, there is limitation of maximum of 15 accessory genomes for web server of ClustAGE. We had run ClustAGE on accessory genome dataset of 4 strains in fasta format using the default settings of a minimum of 80% identity in nucleotide sequence and threshold of hit length of 100 bp at minimum. EDGAR application was used to find out relationship between orthologous gene clusters of these strains. Standard Decay function with fitted equation 1034.555. * x^0.144 (α=0.586) was applied for Pan Genome analysis.

RESULTS AND DISCUSSION
Genomic data availability at NCBI FTP site showed 30 sequenced samples available of A.phagocytophilum and it included four whole genome sequences available for sequences under study. Dendrogram of Anaplasma strains ( Figure  1) depicts close similarity and phylogenetic relationship between HZ,HZ2 and JM strains. However Norway Variant2 is most distant homologue as inferred from maximum phylogenetic distance in the lineage map. Bacterial analysis pipeline results of CGE server show absence of any antibiotic resistance gene at threshold of 80% and 90% similarity in known resistance genes and A.phagocytophilum genomic sequences. However,HZ and HZ2 strains have shown species typing matched to ST161, JM strain to ST-64 and NV strain to ST-82. (Figure 2).  (Table 1). Smallest output segment of core genome varies on an average from 77,748 bases to largest output segment of 7,62,043 bases across these species while Accessory genome varies from 10 bases to 2561 bases. (Figure 3).Maximum variation is observed across the A. phagocytophilum strain Norway variant2. Pangenome development plot (Figure 4) obtained from EDGAR shows approximately 1034 new genes are added to each new species and core genome plot ( Figure 5) shows that as new species are added ,319 genes are conserved in these species showing substantial phylogenetically consensus patterns to be observed during addition of new genomes. ClustAGE plot generated 153 bin gene clusters varying across these genomes and reveals that variable genes of size 5 kbp to 70 kbp are existing across strains HZ, HZ2,JM and NV variants.

CONCLUSION
Identification and analysis of Core Genome and accessory genomic elements in Anaplasma phagocytophilum variants is critical to understand the evolutionary relationship,niche adaptation, virulence factors and infectious potential. Our analysis of sequences NC_007797.1, NC_021879, NC_021880.1 and NZ_CP015376.1 has revealed that only 5% of genomes of this species contains accessory genetic elements and majority of genes are conserved during evolution. Core Genome of Anaplasma phagocytophilum is composed of 1. Variant2 respectively. On analysis by AGENT, 179 accessory genetic elements were obtained between these species varying in size from 10 basepairs to 814 basepairs. Pan Genome analysis shows that on every new genome addition, 1035 genes will be added which is not a very huge number considering the average genome clusters to be 739 in these strains. Further, Species typing has led to interesting insights on matching of substantial portion of genome of this pathogen to Salmonella strains in CGE server. These insights are crucial for developing structure based drugs against Anaplasma species and targeting the infectious potential of this pathogen. High genetic lineage and similarity is a substantial benefit to be explored in developing therapies against this emerging pathogen.