Computational Proteomics

This module introduces computational proteomics using Python, with a focus on the analysis of mass spectrometry-based quantitative proteomics data. The module runs across three four-hour sessions, each built around a Jupyter notebook. You will work in groups throughout. By the end you should be able to load and quality-check a proteomics dataset, build a protein-level abundance matrix from raw search outputs, perform statistical differential abundance analysis, and communicate findings in the form of a short scientific report.

Background: Proteomics and Mass Spectrometry

What is proteomics?

Proteomics is the large-scale study of all proteins expressed in a biological system at a given time — the proteome. Unlike the genome, which is largely static, the proteome is highly dynamic: it changes in response to environmental signals, disease states, drug treatments, and developmental stage. Measuring the proteome therefore gives a direct readout of cellular activity that cannot be inferred from DNA or RNA alone.

Mass spectrometry (MS) is the dominant technology for proteomics. In a typical bottom-up proteomics experiment, proteins are first digested into shorter peptides using a protease (most commonly trypsin, which cleaves after lysine and arginine residues). The resulting peptide mixture is then separated by liquid chromatography (LC) and injected into the mass spectrometer, which measures the mass-to-charge ratio (m/z) of each peptide and its fragment ions. Database searching against a reference proteome assigns peptide sequences to the observed spectra, and proteins are inferred from the identified peptides.

The data hierarchy: PSM → Peptide → Protein

A proteomics search engine produces results at three nested levels:

PSM (Peptide-Spectrum Match): The most granular level. Each PSM is a single match between one observed MS2 spectrum and one peptide sequence. The same peptide may generate many PSMs across different scans or fractions.
Peptide (Peptide Group): PSMs are collapsed to unique peptide sequences. Quantitative values are summarised across PSMs belonging to the same peptide.
Protein (Protein Group): Peptides are mapped to proteins. Because peptides can be shared between homologous proteins, proteins are often reported as groups. Protein abundance is typically summarised from the most intense unique peptides.

Working at the PSM level gives the highest granularity and allows the most rigorous quality filtering, which is why Module 1 begins there.

Tandem mass tags (TMT)

Tandem mass tag (TMT) reagents are isobaric chemical labels attached to peptide N-termini and lysine side chains before MS analysis. Multiple samples (up to 18) can be labelled with chemically distinct TMT reagents that have identical masses in MS1 but release unique reporter ions at defined m/z values during fragmentation. This allows multiple samples to be combined and analysed in a single LC-MS run, eliminating run-to-run variability.

The reporter ion intensities (one per TMT channel per PSM) are the raw quantitative readout. They must be filtered for quality, normalised to remove systematic biases, and aggregated from PSM level up to protein level.

False discovery rate and confidence filtering

Database searches always produce some incorrect matches (false positives). The target-decoy approach controls this by searching a decoy database (reversed sequences) alongside the real database. The fraction of decoy hits passing a score threshold estimates the false discovery rate (FDR). A 1% FDR threshold means approximately 1 in 100 accepted PSMs is expected to be incorrect. Proteome Discoverer reports per-PSM confidence levels (High / Medium / Low) derived from this procedure; only High confidence PSMs are retained for quantitative analysis.

What can proteomics tell us biologically?

Changes in protein abundance between conditions reflect underlying biological processes such as:

Metabolic adaptation
Stress responses
Antibiotic resistance mechanisms
Regulatory pathway activation

In this module, you will analyse a dataset comparing mecillinam-resistant E. coli mutants to a parental strain, allowing you to explore how global protein expression changes in response to antibiotic resistance.

The Dataset: PXD007647

Biological background

Resistance to the antibiotic mecillinam in E. coli can be conferred by mutations in over 100 different genes. Mecillinam targets penicillin-binding protein 2 (PBP2), essential for maintaining bacterial rod shape. Resistance mutations bypass this requirement through diverse mechanisms — making this an excellent system for studying how different genetic changes reshape the proteome.

In this study, global protein expression levels were compared between a panel of mecillinam-resistant mutants derived from E. coli K12 MG1655 and the parental wild-type strain. Ten samples were labelled with TMT-10plex reagents and analysed by LC-MS/MS with MS2 quantification across 8 high-pH reversed-phase (bRP-LC) fractions.

Files

The data files are located in the data/ folder of this repository. See data/README.md for a full description of each file and instructions on how they are accessed in Colab vs. local Jupyter.

Module Structure

Block	Topic	Key concepts & tools
1	Data Processing	PSM loading, quality filtering, contaminant removal, protein rollup — `pandas`, `numpy`
2	Data Analysis	Normalisation, log2 transform, missing value handling, differential abundance, statistical testing — `scipy`, `statsmodels`
3	Data Visualization	Volcano plots, PCA, correlation heatmap, clustering heatmap — `matplotlib`, `seaborn`, `scikit-learn`

Each notebook is self-contained: it downloads or loads the data it needs at the top so you are not dependent on having run the previous notebook in the same session.

Running the Notebooks

Option A — Google Colab (recommended, no installation required)

Go to https://colab.research.google.com
Click File → Open notebook → GitHub, paste the repository URL, and open the desired notebook
Run the install cell at the top of the notebook (Step 1)
Go to Runtime → Restart session
Run the data download cell (Step 2) — this downloads the required data files from GitHub automatically
Continue running cells in order from top to bottom

Important: Colab sessions are temporary. If your session disconnects, re-run the install cell and the data download cell before continuing. Data files and intermediate outputs do not persist — see data/README.md for instructions on saving and re-uploading intermediate files between sessions.

Option B — Local Jupyter

Clone the repository:

git clone https://github.com/UjalaBashir/Bio513_proteomics.git
cd Bio513_proteomics

Install dependencies (once per environment):

pip install pandas numpy matplotlib seaborn scipy statsmodels scikit-learn pyteomics

Launch Jupyter:
```
jupyter notebook
```
Open block1_data_processing.ipynb and run cells in order

The data/ folder is included in the repository, so no manual file downloading is needed. Intermediate files generated by each block are saved automatically to the working directory and loaded by the next block.

Repository Structure

Bio513_proteomics/
├── data/
│   ├── README.md                                          ← Data file descriptions
│   ├── PXD007647_Reproc_TMT-set-2_8fracs_PSMs.txt        ← Block 1 input
│   ├── PXD007647_Reproc_TMT-set-2_8fracs_PeptideGroups.txt
│   ├── PXD007647_Reproc_TMT-set-2_8fracs_ProteinGroups.txt ← Blocks 1–3
│   ├── PXD007647_Reproc_TMT-set-2_8fracs_ResultStatistics.txt
│   └── PXD007647_Reproc_TMT-set-2_8fracs_InputFiles.txt
├── block1_data_processing.ipynb
├── block2_data_analysis.ipynb
├── block3_data_visualization.ipynb
└── README.md

Group Work

You will work in groups and all group members should understand every step. Discuss each exercise together before writing your answer. The notebooks should be submitted with all cells executed and outputs visible. One notebook per group per block is sufficient, but every group member should be able to explain any cell.

Report Guidelines

After completing all three blocks, each group submits a single written report (approximately 2000–3000 words) alongside the three completed notebooks (all cells run, outputs visible).

Report structure:

Introduction (~400 words). Briefly introduce the biological system and the scientific question. What is mecillinam and how does it work? Why is antibiotic resistance in E. coli clinically relevant? What does proteomics add that genomics alone cannot? Cite the source publication (Thulin and Andersson) and at least one additional reference.

Methods (~400 words). Describe the dataset (number of samples, organism, TMT labelling strategy), the preprocessing steps applied at PSM level (confidence filtering, quality thresholds, normalisation; explain briefly why each step is necessary), the method used to build the protein-level abundance matrix, and the statistical approach for differential abundance testing including multiple testing correction.

Results (~800 words). Present findings from all three blocks. Include figures from notebook outputs with captions. Report specific proteins that are significantly differentially abundant between mutant and parental strains, with fold changes, corrected p-values, and effect sizes. Discuss the overall proteome-level picture revealed by PCA and the heatmap.

Discussion (~600 words). Interpret your results biologically. Which cellular pathways or processes are most affected in the resistant mutants? Are the changes consistent with known mechanisms of mecillinam resistance? What are the limitations of this analysis — in terms of the data, the statistical approach, or the biological interpretation?

References. Include the PRIDE dataset source publication, any tools or packages cited, and all literature referenced in the report. It is recommended to use Zotero as a reference manager.

Assessment Criteria

Your group will be assessed on the accuracy and completeness of the notebook outputs, the quality of statistical interpretation (not just reporting numbers but explaining what they mean), the biological reasoning in the discussion, and the clarity of scientific writing.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
block1_data_processing.ipynb		block1_data_processing.ipynb
block2_data_analysis .ipynb		block2_data_analysis .ipynb
block3_data_visualization.ipynb		block3_data_visualization.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Computational Proteomics

Background: Proteomics and Mass Spectrometry

What is proteomics?

The data hierarchy: PSM → Peptide → Protein

Tandem mass tags (TMT)

False discovery rate and confidence filtering

What can proteomics tell us biologically?

The Dataset: PXD007647

Biological background

Files

Module Structure

Running the Notebooks

Option A — Google Colab (recommended, no installation required)

Option B — Local Jupyter

Repository Structure

Group Work

Report Guidelines

Assessment Criteria

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Computational Proteomics

Background: Proteomics and Mass Spectrometry

What is proteomics?

The data hierarchy: PSM → Peptide → Protein

Tandem mass tags (TMT)

False discovery rate and confidence filtering

What can proteomics tell us biologically?

The Dataset: PXD007647

Biological background

Files

Module Structure

Running the Notebooks

Option A — Google Colab (recommended, no installation required)

Option B — Local Jupyter

Repository Structure

Group Work

Report Guidelines

Assessment Criteria

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages