Bioinformatics · Data Science · Pipelines · Automation

Salvatore Barbagallo

Combining 7+ years in clinical genomics with computational engineering, building reproducible NGS pipelines, ML models, and data workflows that transform biological data into decisions.

Open to opportunities · London · Remote
Right to work · UK & EU
// impact
7+
Years in clinical genomics
500+
NGS reports/week
11
Clinical trials
30%
Error reduction via Automation
// core skills
PythonRSQLBash NextflowDockerRNA-seq WGSDESeq2SnpEff scikit-learnPyTorchTableau BigQueryGalaxy
// languages
Italian Native
English Fluent
Portuguese Fluent
Spanish Advanced
// education
MSc Bioinformatics (ongoing)
Atlantic Technological University
MSc Cell & Gene Therapy
University College London
BSc Biomedical Science
University of Catania

About

I combine clinical and genomics laboratory experience with data analysis and process automation to build reliable pipelines, reporting tools, and QC workflows that turn complex biological data into usable results. My background spans regulated clinical environments where I built Python-driven tools, automated QC pipelines, and worked with NGS from library prep through to clinical reporting.

Currently pursuing an MSc in Bioinformatics at Atlantic Technological University, strengthening computational analysis and reproducible pipeline development and automation.

Pipelines & Infrastructure
NextflowDockerGalaxyGitWorkflow automation
NGS / Omics
RNA-seqWGSBWA-MEM2STARbcftoolsSnpEffDESeq2IGV
ML / Data
scikit-learnXGBoostPyTorchKerasTableauBigQuery
Featured Projects
QA Bot for Documents using RAG
Built a Retrieval-Augmented Generation system for PDF question answering using LangChain, Chroma, IBM watsonx, FastAPI, and Gradio.
View details
  • Objective: Build a document question-answering system that retrieves relevant PDF content before generating answers.
  • Approach: Implemented a full RAG pipeline covering document loading, chunking, embeddings, vector storage, retrieval, and answer generation.
  • Outcome: Produced a modular, deployment-ready document QA system that demonstrates practical LLM application design.
LLMRAGLangChainChromaFastAPIGradiowatsonx
View on GitHub
Deep Learning Experiments (Keras vs PyTorch, CNN, ViT)
Comparative deep learning experiments using Keras and PyTorch across CNNs, Vision Transformers, and transfer learning for image classification tasks.
View details
  • Objective: Compare deep learning approaches and frameworks across multiple computer vision tasks.
  • Approach: Implemented CNNs, transfer learning models, and Vision Transformers in both Keras and PyTorch, evaluating performance and training behaviour.
  • Outcome: Demonstrated differences in performance, flexibility, and workflow between frameworks and architectures.
Deep LearningPyTorchKerasCNNVision TransformerTransfer LearningComputer Vision
View on GitHub
Customer Segmentation (Unsupervised Learning)
Segmented 7,043 telecom customers using K-Means, hierarchical clustering, and DBSCAN, identifying a high-risk segment with ~42% churn compared to ~14% in the most stable group.
View details
  • Objective: Identify meaningful customer segments to enable targeted retention strategies instead of treating churn as a uniform problem.
  • Approach: Applied clustering methods on standardised features and selected the final model using silhouette score.
  • Outcome: Produced interpretable segments that can guide retention campaigns, onboarding improvements, and pricing strategies.
PCA clusters
PCA projection of customer clusters (K-Means, k=3)
Machine LearningUnsupervised LearningClusteringK-MeansScikit-learnPandasCustomer Segmentation
View on GitHub
M. tuberculosis WGS Variant Analysis Workflow (Galaxy)
Built an end-to-end WGS workflow for Mycobacterium tuberculosis drug resistance analysis, standardising variant detection and interpretation across QC, alignment, annotation, and read-level validation.
View details
  • Problem: Resistance interpretation from WGS data requires a long multi-step workflow with multiple failure points and high risk of inconsistent analysis.
  • Action: Built and executed a Galaxy-based workflow covering QC, trimming, alignment, coverage analysis, variant calling, annotation, and IGV-based validation.
  • Impact: Produced a structured and reproducible resistance analysis workflow supporting clearer interpretation of clinically relevant loci.
Galaxy workflow
Framed Galaxy workflow for M. tuberculosis WGS variant analysis
BioinformaticsWGSGalaxyBWA-MEM2bcftoolsSnpEffIGV
View on GitHub
RNA-seq Pipeline (Nextflow + Docker)
Eliminated inconsistent and error-prone RNA-seq analyses by building a containerised workflow that standardises processing from raw FASTQ to differential expression outputs.
View details
  • Problem: RNA-seq analyses are often manually assembled, environment-dependent, and difficult to reproduce across machines.
  • Action: Built a containerised Nextflow pipeline integrating FastQC, Cutadapt, STAR, featureCounts, MultiQC, and DESeq2 into a one-command workflow.
  • Impact: Reduced setup friction, improved reproducibility, and produced standardised outputs suitable for scalable downstream expression analysis.
Nextflow run
Proof run: Nextflow test execution (Docker profile)
BioinformaticsPipelines & AutomationNextflowDockerRNA-seqSTARDESeq2
View on GitHub
Additional Projects
Bellabeat - Wearable Health Data Analysis
Transformed fragmented wearable data into a structured analytics dataset to uncover behavioural patterns across activity, sleep, heart rate, and weight for 35 users.
View details
  • Problem: Raw Fitbit data was split across multiple files and not directly usable for behavioural analysis or product-facing insight generation.
  • Action: Cleaned, merged, and summarised wearable datasets using R, dplyr, and SQLite, then created a Tableau-ready user summary with 15 behavioural and wellness variables.
  • Impact: Enabled clearer identification of activity, sleep, and health trends to support data-driven product and marketing recommendations.
Bellabeat Tableau Dashboard
Data AnalyticsRSQLdplyrTableauWearable Data
View on GitHub
Google Fiber - Contact Centre Analytics & BI Dashboard
Identified repeat-contact patterns across 85,179 customer interactions by consolidating fragmented regional datasets into a unified analytics view, enabling clearer visibility into service inefficiencies.
View details
  • Problem: Customer contact data was split across regional datasets, making it difficult to understand repeat-contact behaviour and pinpoint service pain points.
  • Action: Consolidated 3 datasets into a unified analysis table covering 1,350 records and 85,179 contact events across 5 contact types, then built a Tableau dashboard to surface repeat-contact patterns and operational friction.
  • Impact: Turned fragmented customer-support data into a decision-ready BI view that can support targeted service improvements and reduction of avoidable support load.
Google Fiber Tableau Dashboard
Business IntelligenceCustomer AnalyticsTableauSQLKPI DesignTelecom Analytics
View on GitHub
Salifort Motors - Employee Attrition Prediction
Built predictive models on 14,999 employee records to identify the drivers of 23.8% turnover and translate model outputs into retention-focused recommendations.
View details
  • Problem: The business faced substantial employee turnover, but the main drivers of attrition were not clearly understood.
  • Action: Performed exploratory analysis, engineered features, and trained classification models using Python, scikit-learn, and XGBoost to analyse attrition patterns across 10 departments.
  • Impact: Produced evidence-based recommendations to support workforce retention strategy and prioritise the factors most associated with employee loss.
Data ScienceMachine LearningPythonscikit-learnXGBoostHR Analytics
View on GitHub
TikTok - Social Media Engagement Analysis
Analysed 19,382 posts to uncover highly skewed engagement patterns and identify content features associated with stronger performance.
View details
  • Problem: Engagement was unevenly distributed, making it difficult to understand which content characteristics were linked to stronger performance.
  • Action: Used Python, pandas, and visual analytics to explore engagement metrics, inspect distribution patterns, and examine relationships between content features and post performance.
  • Impact: Generated evidence-based recommendations to support content optimisation and stronger audience engagement strategy.
Data AnalyticsExploratory Data AnalysisPythonPandasMatplotlibPlotly
View on GitHub
AWS Managed Services - Cloud Migration Architecture
High-level AWS architecture design for migrating on-premises workloads to a cloud-native, fully managed solution, ensuring scalability, fault-tolerance, and operational efficiency.
View details
  • Objective: Migrate two on-prem workloads, a three-tier web application and a Hadoop-based analytics environment, into a modern AWS environment with managed services.
  • Approach: Designed an end-to-end cloud solution using AWS managed services including CloudFront, S3, ECS on Fargate, ALB, Aurora MySQL, ElastiCache, SQS, EMR, Glue, Athena, Redshift, and QuickSight.
  • Outcome: Produced a decoupled, fault-tolerant, multi-AZ architecture that modernises both workloads while reducing operational overhead through managed services.
AWS Migration Architecture Diagram
AWS Architecture Diagram - Migration Solution
Cloud ArchitectureAWSData InfrastructureECS FargateAurora MySQLRedshift
View on GitHub

Experience

UCL Hospitals
Sep 2021 – Dec 2025
London, UK
Specialist Biomedical Scientist · Stem Cell Laboratory
  • Delivered processing and cryopreservation of PBSCs, bone marrow, DLI, and CD34+ enriched products in a regulated clinical environment.
  • Generated flow cytometry data on CD3+ and CD34+ populations for time-sensitive clinical decision-making.
  • Supported 11 active clinical and ATMP trials, including CAR-T therapies, with responsibility for traceability, documentation quality, and audit-ready data handling in a high-stakes regulated clinical setting.
  • Designed Python/Excel tracking tools replacing manual reconciliation, reduced inventory errors by ~30%, saved ~10 hours/week.
  • Led digitisation of SOPs and QA documentation, standardising data handling across workflows.
CooperGenomics
Jul 2019 – Sep 2021
London, UK
Laboratory Scientist · Clinical Genomics
  • Processed embryo samples for PGT-A, PGT-SR, and PGT-M testing within a high-throughput clinical genomics pipeline.
  • Delivered NGS library preparation and QC across 96–192 samples/run; produced 500+ clinical reports per week.
  • Programmed and validated Mosquito HV and Dragonfly liquid handlers, improving workflow scalability and reproducibility.
  • Contributed to SOP writing and review, strengthening ISO-compliant laboratory practice.
Leicester Royal Infirmary
Dec 2018 – Jun 2019
Leicester, UK
Biomedical Laboratory Assistant · Cytology
  • Managed sample reception and prepared specimens for Papanicolaou staining.
  • Maintained reagents and ensured sample integrity end-to-end.

Education

2014 – 2017
BSc, Biomedical Science
University of Catania
Catania, Italy
Cytotoxicity assays using SIRC, ARPE-19, and HRPE cells
2021 – 2023
MSc, Cell & Gene Therapy
University College London
London, UK
Expansion and Preservation of Haematopoietic Potential in Human Amniotic Fluid Stem Cells for Therapeutic Applications
Sep 2025 – Present
MSc, Bioinformatics
Atlantic Technological University
Letterkenny, Ireland · Remote
Planned dissertation: Two-sample Mendelian randomisation and Bayesian colocalisation for causal inference using GWAS and eQTL data

Certifications

IBM
Machine Learning · AI Engineering
Google
Data Analytics · Advanced Data Analytics · IT Automation with Python · Project Management · Business Intelligence
Google Cloud
Architecting with Google Kubernetes Engine
AWS
Cloud Practitioner Essentials · Cloud Solutions Architect
Johns Hopkins
Genomic Data Science Specialization
SAS
SAS Programming 1: Essentials · SAS Programming 2: Data Manipulation Techniques
Wellcome
Bioinformatics for Biologists: Linux, Bash, R · Analysing Genomics Datasets
freeCodeCamp
Data Analysis with Python · Relational Databases · Scientific Computing
DE<code>LIFE
Genomes, Networks & Pathways · Data Science & Machine Learning
Coursera
Access Bioinformatics Databases with Biopython
Le Wagon
Data Visualization with Tableau

Contact

Let's Connect

Open to opportunities in bioinformatics, data science, and data pipelines and automation. Based in London and remote-friendly.