5 min read

Analyse différentielle d’expression de gènes

Le mardi 25 octobre a eu lieu une animation du groupe de travail StatOmique soutenu par le gdr BioInformatique Moléculaire, autour de l’analyse différentielle d’expression de gènes sur le nouveau campus Agro Paris-Saclay, 22 place de l’Agronomie à Palaiseau (coordonnées GPS: 48.71401 2.1964).

Les infos pour venir sur le campus depuis le site de l’école ici

Programme

Abstract

Inter-species RNA-Seq datasets are increasingly common, and have the potential to answer new questions on gene expression patterns across the evolution. Single species differential expression analysis is a now well studied problem, that benefits from sound statistical methods. Extensive reviews on biological or synthetic datasets have provided the community with a clear picture on the relative performances of the available tools in various settings. Such benchmarks are still missing in the inter-species gene expression context. In this work, we take a first step in this direction by developing and implementing a new simulation framework. This tool builds on both the RNA-Seq and the Phylogenetic Comparative Methods literatures to generate realistic count datasets, while taking into account the phylogenetic relationships between the samples. We illustrate the features of this new framework through a targeted simulation study, that reveals some of the strengths and weaknesses of both the classical and phylogenetic approaches for inter-species differential expression analysis. The tool has been integrated in the R package compcodeR freely available on Bioconductor.

References

Bastide, P., Soneson, C., Lespinet, O., Gallopin, M. Benchmark of Differential Gene Expression Analysis Methods for Inter-species RNA-Seq Data using a Phylogenetic Simulation Framework. bioarvix:2022.01.21.476612

compcodeR Soneson C (2014). compcodeR - an R package for benchmarking differential expression methods for RNA-seq data. Bioinformatics, 30(17), 2517-2518.

Abstract

Current advances in omics technologies are paving the diagnosis of rare diseases proposing as a complementary assay to identify the responsible gene. The use of transcriptomic data to identify aberrant gene expression (AGE) have demonstrated to yield potential pathogenic events. However popular approaches for AGE identification are limited by the use of statistical tests that imply the choice of arbitrary cut-off for significance assessment and the availability of several replicates not always possible in clinical contexts. Machine learning methods via neural networks including autoencoders (AEs) or variational autoencoders (VAEs) have shown promising performances in medical fields. Here, we describe ABEILLE, (ABerrant Expression Identification empLoying machine LEarning from sequencing data), a novel method for the identification of AGE from RNA-seq data without the need of replicates or a control group, using a flexible model obtained after testing several parameters. ABEILLE combines the use of a VAE, able to model any data without specific assumptions on their distribution, and a decision tree to classify genes as AGE or non-AGE. An anomaly score is associated to each AGE in order to stratify them by severity of aberration. We compare ABEILLE performances to the state-of-the-art alternatives by using semi- synthetic data and a real dataset demonstrating the importance of the flexibility of the VAE configuration to identify potential pathogenic candidates.

References

Justine Labory, Gwendal Le Bideau, David Pratella, Jean-Elisée Yao, Samira Ait-El-Mkadem Saadi, Sylvie Bannwarth, Loubna El-Hami, Véronique Paquis-Fluckinger, Silvia Bottini, ABEILLE: a novel method for ABerrant Expression Identification empLoying machine Learning from RNA-sequencing data, Bioinformatics, 2022, btac603.

ABEILLE source code is freely available at https://github.com/UCA-MSI/ABEILLE.

Résumé

Résumé

Les screens CRISPR-Cas9 permettent l’interrogation de la fonction des gènes à grande échelle. Des perturbations génétiques sont introduites dans des pools de cellules qui sont ensuite soumises à une pression de sélection (compétition cellulaire, traitement, infection virale…). Les effets de ces perturbations sont ensuite évalués par séquençage des guide RNA spécifiques de chaque gène. L’analyse de ce type de données soulève plusieurs questions à cause des caractéristiques de l’expérience (notamment la présence de plusieurs guides par gènes). Après présentation de la technologie et de ses buts nous verrons le workflow d’analyse mis en place : comptage, analyse au niveau guide RNA et agrégation au niveau gène.

Résumé

Abstract

By reproducing differential expression analysis simulation results presented by Li et al, we identified a caveat in the data generation process. Data not truly generated under the null hypothesis led to incorrect comparisons of benchmark methods. We provide corrected simulation results that demonstrate the good performance of dearseq and argue against the superiority of the Wilcoxon rank-sum test as suggested by Li et al.

References

Hejblum, B., Ba, K., Thiébaut, R. , Agniel, D. Neglecting normalization impact in semi-synthetic RNA-seq data simulation generates artificial false positives. biorxiv:2022.05.10.490529

Li, Y., Ge, X., Peng, F. et al. Exaggerated false positives by popular differential expression methods when analyzing human population samples. Genome Biology, 79, 23 (2022). doi.org/10.1186/s13059-022-02648-4

  • 16:30 Discussion échange autour des pratiques de l’analyse différentielle d’expression de gènes

  • 17:00 Clôture de la journée