Skip to content

RyanFabrick/Multiple-Linear-Regression-Model-Diamond-Project-

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Diamond Price Prediction Project

This is a regression analysis project predicting diamond prices using a dataset of 53,943 observations from Kaggle. This project builds and evaluates linear regression models to predict diamond prices using carat, cut, color, clarity, and physical dimensions. It covers descriptive statistics, simple linear regression, log-log transformation, model selection, and multicollinearity diagnostics.

Table of Contents

Why Did I Build This?

I built this for a Linear Regression upper division elective course during my third year at UCSB as a Statistics and Data Science major. This was a final project assignment and received full credit for code accuracy, code justifications, result interpretations, and overall analysis. This course specifically dove in depth on the mathematical theory and proofs behind different linear regression models. The coding portion for this course was taught in R, thus I completed the project in R.

Results

Model Adjusted R²
SLR: price ~ carat 0.8563
Log-Log SLR: log(price) ~ log(carat) 0.9372
Full MLR (all variables) 0.9833
Final Model (log_carat + cut + color + clarity) 0.9829

Final Prediction — 0.7 carat, Ideal cut, G color, VS2 clarity:

  • Fitted price: $2,777.41
  • 95% CI: ($2,727.04, $2,828.71)
  • 95% PI: ($2,140.57, $3,603.72)

Outline & Analysis

For more in depth explanations, code, reasoning, thought process, justification, and analysis, go to the main file. For a brief outline and analysis look below. And for full rendered PDF report, scroll down or click here.

Part 1 — Data Description

  • Random sample of 2,000 diamonds from the full dataset
  • Histograms for continuous variables (carat, depth, table, price, x, y, z)
  • Bar plots for categorical variables (cut, color, clarity)
  • Correlation matrix revealing strong carat price correlation (r = 0.93) and multicollinearity among x, y, z

Part 2 — Simple Linear Regression

  • Baseline SLR: price ~ carat
  • Assumption diagnostics (residuals vs. fitted, QQ plot, Scale-Location, leverage)
  • Log-log transformation to resolve non-linearity and heteroscedasticity
  • Sequential variable addition tracking adjusted R² improvement

Part 3 — Multiple Linear Regression

  • Backward AIC, Backward BIC, and Stepwise AIC model selection
  • VIF diagnostics (x: 673.23, y: 668.20 — severe multicollinearity confirmed)
  • Final model removes x, y, z; all VIF values below 1.15

Rendered PDF

To view the rendered PDF from the main file that is produced, click here.

Dependencies

library(ggplot2)
library(GGally)
library(dplyr)
library(patchwork)
library(corrplot)
library(car)

License

© 2026 Ryan Fabrick. All rights reserved. This project may not be reused, adapted, or submitted for academic coursework without explicit written permission from the author.

Author

Ryan Fabrick

Acknowledgements & References

  • Kaggle Dataset — Diamonds Prices 2022, the source dataset containing 53,943 diamond observations with price, carat, cut, color, clarity, and physical dimension attributes
  • R Project — The statistical computing language used to build, evaluate, and visualize all regression models in this project
  • R Markdown — The authoring framework used to combine R code, output, and written analysis into a single reproducible PDF report
  • ggplot2 — Tidyverse package used for all data visualizations including histograms, bar plots, and scatter plots
  • corrplot — R package used to generate the correlation matrix heatmap for visualizing relationships between numerical variables
  • car — R package used for Variance Inflation Factor (VIF) diagnostics to detect and address multicollinearity

Built with ❤️ for UCSB

This project demonstrates my interest in machine learning, predictive modeling, and statistics. It received full credit for an undergraduate course in Linear Regression Models.

About

Predicting diamond prices with multiple linear regression in R using log-log transformation, AIC/BIC model selection, and VIF diagnostics on 53,943 observations

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors