Model evaluation

Ways to measure how well the model is performing:
Perplexity
Human Eval (optional)

This is on feat/evaluation: 
https://github.com/MichiganDataScienceTeam/F24-mini-copilot/tree/feat/evaluation

I will be gone this weekend, just opening this issue so anyone can get started