Bioinformatics · Data Science · Pipelines · Automation

Salvatore Barbagallo

Combining 7+ years in clinical genomics with computational engineering, building reproducible NGS pipelines, ML models, and data workflows that transform biological data into decisions.

Open to opportunities · London · Remote

Right to work · UK & EU

// impact

Years in clinical genomics

500+

NGS reports/week

Clinical trials

30%

Error reduction via Automation

// core skills

// languages

// education

MSc Bioinformatics (ongoing)

Atlantic Technological University

MSc Cell & Gene Therapy

University College London

BSc Biomedical Science

University of Catania

About

I combine clinical and genomics laboratory experience with data analysis and process automation to build reliable pipelines, reporting tools, and QC workflows that turn complex biological data into usable results. My background spans regulated clinical environments where I built Python-driven tools, automated QC pipelines, and worked with NGS from library prep through to clinical reporting.

Currently pursuing an MSc in Bioinformatics at Atlantic Technological University, strengthening computational analysis and reproducible pipeline development and automation.

Pipelines & Infrastructure

NextflowDockerGalaxyGitWorkflow automation

NGS / Omics

RNA-seqWGSBWA-MEM2STARbcftoolsSnpEffDESeq2IGV

ML / Data

scikit-learnXGBoostPyTorchKerasTableauBigQuery

Projects

View all on GitHub ↗

Featured Projects

RegulonML - Real MPRA Regulatory Variant-Effect Modelling

Built a real-data MPRA modelling workflow using public saturation-mutagenesis reporter-assay data to prepare regulatory variant features, train interpretable regression models, and evaluate variant-effect prediction with leakage-aware element-level splitting.

View details

Problem: Regulatory variant-effect modelling requires careful handling of MPRA count data, assay metadata, and train/test leakage between variants from the same regulatory element.
Action: Added scripts to download public Kircher saturation-mutagenesis MPRA data, filter promoter elements, engineer variant/count features, and train Elastic Net and Random Forest regression models.
Impact: Created a defensible real-data regulatory-genomics workflow with documented provenance, element-level group splitting, permutation importance, and clear limitations around full promoter-design claims.

BioinformaticsMPRARegulatory GenomicsPythonscikit-learnElastic NetRandom Forest

View on GitHub

QA Bot for Documents using RAG

Built a Retrieval-Augmented Generation system for PDF question answering using LangChain, Chroma, IBM watsonx, FastAPI, and Gradio.

View details

Objective: Build a document question-answering system that retrieves relevant PDF content before generating answers.
Approach: Implemented a full RAG pipeline covering document loading, chunking, embeddings, vector storage, retrieval, and answer generation.
Outcome: Produced a modular, deployment-ready document QA system that demonstrates practical LLM application design.

LLMRAGLangChainChromaFastAPIGradiowatsonx

View on GitHub

SoftCart Data Engineering Platform

Built an end-to-end e-commerce data engineering project covering raw data ingestion, SQL/NoSQL querying, dimensional warehouse design, ETL automation, PostgreSQL analytics, BI reporting, and Spark workflow practice.

View details

Objective: Build a structured e-commerce data platform that connects operational sales data, NoSQL product data, warehouse modelling, reporting outputs, and scalable analytics practice.
Approach: Reorganised the IBM Data Engineering capstone into a clean portfolio repository, fixed warehouse SQL inconsistencies, loaded dimensional tables into PostgreSQL, and generated validated analytical CSV outputs.
Outcome: Produced a reproducible data engineering artefact with 300,000 fact rows loaded, warehouse validation notes, sales aggregations by country/category, rollup and cube outputs, ETL scripts, MongoDB commands, Spark notebook, and BI report.

SoftCart data platform architecture showing the flow from customer-facing web activity into MySQL and MongoDB, through PostgreSQL staging and DB2 warehousing, with Cognos Analytics for BI reporting and Spark for large-scale analytics.

Data EngineeringPostgreSQLSQLMongoDBETLData WarehouseSparkBI Reporting

View on GitHub

M. tuberculosis WGS Variant Analysis Workflow (Galaxy)

Built an end-to-end WGS workflow for Mycobacterium tuberculosis drug resistance analysis, standardising variant detection and interpretation across QC, alignment, annotation, and read-level validation.

View details

Problem: Resistance interpretation from WGS data requires a long multi-step workflow with multiple failure points and high risk of inconsistent analysis.
Action: Built and executed a Galaxy-based workflow covering QC, trimming, alignment, coverage analysis, variant calling, annotation, and IGV-based validation.
Impact: Produced a structured and reproducible resistance analysis workflow supporting clearer interpretation of clinically relevant loci.

Framed Galaxy workflow for M. tuberculosis WGS variant analysis

BioinformaticsWGSGalaxyBWA-MEM2bcftoolsSnpEffIGV

View on GitHub

RNA-seq Pipeline (Nextflow + Docker)

Eliminated inconsistent and error-prone RNA-seq analyses by building a containerised workflow that standardises processing from raw FASTQ to differential expression outputs.

View details

Problem: RNA-seq analyses are often manually assembled, environment-dependent, and difficult to reproduce across machines.
Action: Built a containerised Nextflow pipeline integrating FastQC, Cutadapt, STAR, featureCounts, MultiQC, and DESeq2 into a one-command workflow.
Impact: Reduced setup friction, improved reproducibility, and produced standardised outputs suitable for scalable downstream expression analysis.

Proof run: Nextflow test execution (Docker profile)

BioinformaticsPipelines & AutomationNextflowDockerRNA-seqSTARDESeq2

View on GitHub

Additional Projects

Deep Learning Experiments (Keras vs PyTorch, CNN, ViT)

Comparative deep learning experiments using Keras and PyTorch across CNNs, Vision Transformers, and transfer learning for image classification tasks.

View details

Objective: Compare deep learning approaches and frameworks across multiple computer vision tasks.
Approach: Implemented CNNs, transfer learning models, and Vision Transformers in both Keras and PyTorch, evaluating performance and training behaviour.
Outcome: Demonstrated differences in performance, flexibility, and workflow between frameworks and architectures.

Deep LearningPyTorchKerasCNNVision TransformerTransfer LearningComputer Vision

View on GitHub

Customer Segmentation (Unsupervised Learning)

Segmented 7,043 telecom customers using K-Means, hierarchical clustering, and DBSCAN, identifying a high-risk segment with ~42% churn compared to ~14% in the most stable group.

View details

Objective: Identify meaningful customer segments to enable targeted retention strategies instead of treating churn as a uniform problem.
Approach: Applied clustering methods on standardised features and selected the final model using silhouette score.
Outcome: Produced interpretable segments that can guide retention campaigns, onboarding improvements, and pricing strategies.

PCA projection of customer clusters (K-Means, k=3)

Machine LearningUnsupervised LearningClusteringK-MeansScikit-learnPandasCustomer Segmentation

View on GitHub

Bellabeat - Wearable Health Data Analysis

Transformed fragmented wearable data into a structured analytics dataset to uncover behavioural patterns across activity, sleep, heart rate, and weight for 35 users.

View details

Problem: Raw Fitbit data was split across multiple files and not directly usable for behavioural analysis or product-facing insight generation.
Action: Cleaned, merged, and summarised wearable datasets using R, dplyr, and SQLite, then created a Tableau-ready user summary with 15 behavioural and wellness variables.
Impact: Enabled clearer identification of activity, sleep, and health trends to support data-driven product and marketing recommendations.

View interactive Tableau dashboard ↗

Data AnalyticsRSQLdplyrTableauWearable Data

View on GitHub

Google Fiber - Contact Centre Analytics & BI Dashboard

Identified repeat-contact patterns across 85,179 customer interactions by consolidating fragmented regional datasets into a unified analytics view, enabling clearer visibility into service inefficiencies.

View details

Problem: Customer contact data was split across regional datasets, making it difficult to understand repeat-contact behaviour and pinpoint service pain points.
Action: Consolidated 3 datasets into a unified analysis table covering 1,350 records and 85,179 contact events across 5 contact types, then built a Tableau dashboard to surface repeat-contact patterns and operational friction.
Impact: Turned fragmented customer-support data into a decision-ready BI view that can support targeted service improvements and reduction of avoidable support load.

View interactive Tableau dashboard ↗

Business IntelligenceCustomer AnalyticsTableauSQLKPI DesignTelecom Analytics

View on GitHub

Salifort Motors - Employee Attrition Prediction

Built predictive models on 14,999 employee records to identify the drivers of 23.8% turnover and translate model outputs into retention-focused recommendations.

View details

Problem: The business faced substantial employee turnover, but the main drivers of attrition were not clearly understood.
Action: Performed exploratory analysis, engineered features, and trained classification models using Python, scikit-learn, and XGBoost to analyse attrition patterns across 10 departments.
Impact: Produced evidence-based recommendations to support workforce retention strategy and prioritise the factors most associated with employee loss.

Data ScienceMachine LearningPythonscikit-learnXGBoostHR Analytics

View on GitHub

TikTok - Social Media Engagement Analysis

Analysed 19,382 posts to uncover highly skewed engagement patterns and identify content features associated with stronger performance.

View details

Problem: Engagement was unevenly distributed, making it difficult to understand which content characteristics were linked to stronger performance.
Action: Used Python, pandas, and visual analytics to explore engagement metrics, inspect distribution patterns, and examine relationships between content features and post performance.
Impact: Generated evidence-based recommendations to support content optimisation and stronger audience engagement strategy.

Data AnalyticsExploratory Data AnalysisPythonPandasMatplotlibPlotly

View on GitHub

AWS Managed Services - Cloud Migration Architecture

High-level AWS architecture design for migrating on-premises workloads to a cloud-native, fully managed solution, ensuring scalability, fault-tolerance, and operational efficiency.

View details

Objective: Migrate two on-prem workloads, a three-tier web application and a Hadoop-based analytics environment, into a modern AWS environment with managed services.
Approach: Designed an end-to-end cloud solution using AWS managed services including CloudFront, S3, ECS on Fargate, ALB, Aurora MySQL, ElastiCache, SQS, EMR, Glue, Athena, Redshift, and QuickSight.
Outcome: Produced a decoupled, fault-tolerant, multi-AZ architecture that modernises both workloads while reducing operational overhead through managed services.

AWS Architecture Diagram - Migration Solution

Cloud ArchitectureAWSData InfrastructureECS FargateAurora MySQLRedshift

View on GitHub

Experience

UCL Hospitals

Sep 2021 – Dec 2025

London, UK

Specialist Biomedical Scientist · Stem Cell Laboratory

Delivered processing and cryopreservation of PBSCs, bone marrow, DLI, and CD34+ enriched products in a regulated clinical environment.
Generated flow cytometry data on CD3+ and CD34+ populations for time-sensitive clinical decision-making.
Supported 11 active clinical and ATMP trials, including CAR-T therapies, with responsibility for traceability, documentation quality, and audit-ready data handling in a high-stakes regulated clinical setting.
Designed Python/Excel tracking tools replacing manual reconciliation, reduced inventory errors by ~30%, saved ~10 hours/week.
Led digitisation of SOPs and QA documentation, standardising data handling across workflows.

CooperGenomics

Jul 2019 – Sep 2021

London, UK

Laboratory Scientist · Clinical Genomics

Processed embryo samples for PGT-A, PGT-SR, and PGT-M testing within a high-throughput clinical genomics pipeline.
Delivered NGS library preparation and QC across 96–192 samples/run; produced 500+ clinical reports per week.
Programmed and validated Mosquito HV and Dragonfly liquid handlers, improving workflow scalability and reproducibility.
Contributed to SOP writing and review, strengthening ISO-compliant laboratory practice.

Leicester Royal Infirmary

Dec 2018 – Jun 2019

Leicester, UK

Biomedical Laboratory Assistant · Cytology

Managed sample reception and prepared specimens for Papanicolaou staining.
Maintained reagents and ensured sample integrity end-to-end.

Education

2014 – 2017

BSc, Biomedical Science

University of Catania

Catania, Italy

Cytotoxicity assays using SIRC, ARPE-19, and HRPE cells

2021 – 2023

MSc, Cell & Gene Therapy

University College London

London, UK

Expansion and Preservation of Haematopoietic Potential in Human Amniotic Fluid Stem Cells for Therapeutic Applications

Sep 2025 – Present

MSc, Bioinformatics

Atlantic Technological University

Letterkenny, Ireland · Remote

Planned dissertation: Two-sample Mendelian randomisation and Bayesian colocalisation for causal inference using GWAS and eQTL data

Certifications

IBM

Machine Learning · AI Engineering · Data Engineering

Google

Data Analytics · Advanced Data Analytics · IT Automation with Python · Project Management · Business Intelligence

Google Cloud

Architecting with Google Kubernetes Engine

Amazon Web Services (AWS)

Cloud Practitioner Essentials · Cloud Solutions Architect

Johns Hopkins University

Genomic Data Science Specialization

SAS

SAS Programming 1: Essentials · SAS Programming 2: Data Manipulation Techniques

Wellcome

Bioinformatics for Biologists: Linux, Bash, R · Analysing Genomics Datasets

Coursera

Access Bioinformatics Databases with Biopython

freeCodeCamp

Data Analysis with Python · Relational Databases · Scientific Computing

Le Wagon

Data Visualization with Tableau

DE<code>LIFE

Genomes, Networks & Pathways · Data Science & Machine Learning

Contact

Let's Connect

Open to opportunities in bioinformatics, data science, and data pipelines and automation. Based in London and remote-friendly.

Bioinformatics CV Data & Analytics CV