Finish Exercise 3

2026-02-23 13:51:08 +00:00 · 2026-02-23 13:51:08 +00:00 · 129f37d139
commit 129f37d139
parent 4c8a1d0cd0
8 changed files with 250 additions and 0 deletions
--- a/3/.chktexrc
+++ b/3/.chktexrc
@ -0,0 +1,4 @@
+CmdLine
+{
+	--nowarn 3 --nowarn 36
+}
--- a/3/Images/TimeVHeightMag.png
+++ b/3/Images/TimeVHeightMag.png
--- a/3/Images/correlation.png
+++ b/3/Images/correlation.png
--- a/3/Images/residuals.png
+++ b/3/Images/residuals.png
--- a/3/Images/trueVpred.png
+++ b/3/Images/trueVpred.png
--- a/3/exercise3.py
+++ b/3/exercise3.py
@ -8,12 +8,25 @@ from sklearn.model_selection import train_test_split
 from sklearn.linear_model import SGDRegressor
 from sklearn.preprocessing import StandardScaler

+"""
+The below section establishes some initial variables which will remain consistent throughout the program,
+such as the columns in the csv, the units for each, all the materials tested and the different radii tested
+"""
+
 columns = ["Material", "Density", "Radius", "Mass", "Temperature", "Pressure", "Height", "Time"]
 columnsNoMaterial = ["Density", "Radius", "Mass", "Temperature", "Pressure", "Height", "Time"]
 units = ["", "kg/m^3", "m", "kg", "K", "Pa", "m", "s"]
 materials = ["magnesium", "polycarbonate", "silica", "zinc_oxide", "silicon_carbide", "titanium", "iron"]
 radii = [0.005, 0.01, 0.015, 0.02, 0.025]

+"""
+This function reads the csv file and imports is into a pandas dataframe, with the correct names for each column
+it then applies some corrections to the data, first making sure all the data that should be numeric is numeric
+which converts any non numerical data to NaN, which can be filtered out
+the function then deletes any lines which have a material not in the 'materials' list, and then converts any negative values to be positive
+finally, the function removes any rows containing NaN, then returns the cleaned dataframe
+"""
+
 def getData(file):
    columns = ["Material", "Density", "Radius", "Mass", "Temperature", "Pressure", "Height", "Time"]
    data = pd.read_csv(file, sep=',', names=columns, skiprows=9, on_bad_lines='skip')
@ -35,6 +48,11 @@ def getData(file):
    data.dropna(inplace=True)
    return data

+"""
+This function takes the column name aand its units as input, then outputs statistics of the column, including min, max, mean ans standard deviation,
+and then prints them with the relvant units.
+"""
+
 def columnStats(column, units):
    min = df[column].min()
    max = df[column].max()
@ -52,6 +70,14 @@ df = getData('exercise3data.csv')

 ####Part 1

+"""
+This function performs all the operations for part 1
+First, it iterates through the columns, skkipping over material, and then uses the previous columnStats function to output the statistics for each column
+Then, it iterates through the list of materials, and for each one, filters the data frame to just the rows with that material.
+For each material, it then iterates through the list of radii, again filtering the dataframe to contain just rows with that radius, and then plots the remaining rows.
+Once every radius has been plotted, the plot is then shown, with the correct labels, title and legend.
+"""
+
 def part1():
    for i in range(len(columns)):
        if columns[i] == "Material":
@ -73,6 +99,14 @@ def part1():

 ####Part 2    

+"""
+This function performs all the operations for part 2.
+First, it removes the Material column from the dataframe, as this interferes with the subsequent operations
+It then uses the .corr() function to calculate thye correlations between the various parameters, and assigns this to a variable
+A plot is then made of this matrix, with bounds set to -1 to 1, and then the function iterates through earch tile on the plot and labels it with the relevant value
+Finally, a colour bar legend is added to the plot and the plot is given a title, and is then shown
+"""
+
 def part2():
    dfNoMaterial = df.drop("Material", axis=1)
    corrMatrix = dfNoMaterial.corr(method='pearson')
@ -90,10 +124,34 @@ def part2():

    fig.colorbar(im)
    fig.tight_layout()
+    fig.suptitle("Correlation between each parameter")
    plt.show()

 ####Part 3

+"""
+This function performs all the operations for section 3
+First, filtered dataframes are made, one with only the features affecting drop time, and one with just the drop time
+A linear regression of these values is then calculated using the sklearn function, and the coefficients are printed with their relevant units
+The dataframe is then filtered to contain only a single material, iron
+A function is then defined that takes values for density, radius, mass, temp, pressure and height as input, and then uses the coefficiients calculated by the linear fit
+to calculate and return a value for fall time.
+A function to predict the fall times using the linear fit is then defined.
+It first filters the dataframe by radius, then plots the ex,perimental, or 'true' data as a scatter plot
+The .predict function is then used, taking the dataframe of features as an input, to plot the predictions of each drop time based on fall distance.
+The fitByMeans function is then used, by passing the mean of each column and two values of drop height, and this data is used to plot a straigfht line of best fit
+The axis are given labels, and the plot is then shown.
+
+The data is then split randomly into 90% fo training data and 10% for test data.
+Again, the LinearRegression function is used to calculate a linear regression based thi time of the random smple of test data.
+For each radius, the dataframe is first filtered, and the true values of fall time are plotted, and their R^2 value is calculated.
+The .predict function is again used to calculate the predicted values for fall time based off the training set, and this is plotted on the same graph, and its R^2 value is also calculated
+The R^2 values are then printed, and the plot is shown.
+
+Finaly, a function to plot the residuals between the true and predicted data is defined.
+It finds the differnce between each true value for time and its predicted one, and then plots these residuals against radius.
+"""
+
 def part3():
    features = df[["Density", "Radius", "Mass", "Temperature", "Pressure", "Height"]]
    targets = df["Time"]
@ -158,6 +216,15 @@ def part3():

    calcResiduals()

+"""
+This function performs all the operations for section 4
+First, the dataframe is again split up into the columns for the features affecting drop time and drop time.
+The SDG regressor function is then used to calculate an unscaled linear fit of the data, and its R^2 value is calculated and printed.
+The data is then scaled using the StandardScaler function, and the scaled features are saved to a variable.
+The linear regression is then calculated again, its R^2 value is calculated and prnted, and the coefficients for each fearture are listed along with their relevant units
+This process is then repeated again using the huber loss function rather than the least squares function.
+"""
+
 def part4():
    reg = SGDRegressor()

--- a/3/main.pdf
+++ b/3/main.pdf
--- a/3/main.tex
+++ b/3/main.tex
@ -0,0 +1,179 @@
+\documentclass[11pt]{article}
+\usepackage{amsmath}
+\usepackage{autobreak}
+\usepackage{lineno,hyperref}
+\usepackage[table,x11names,dvipsnames,table]{xcolor}
+\usepackage{authblk}
+\usepackage{subcaption,booktabs}
+\usepackage{graphicx}
+\usepackage{multirow}
+\usepackage{listings}
+\usepackage{color}
+\definecolor{blue}{rgb}{0.02,0.65,0.90}
+\definecolor{dkgreen}{rgb}{0.25,0.63,0.17}
+\definecolor{gray}{rgb}{0.5,0.5,0.5}
+\definecolor{mauve}{rgb}{0.53,0.22,0.94}
+\lstset{frame=tb,
+  language=Python,
+  aboveskip=3mm,
+  belowskip=3mm,
+  showstringspaces=false,
+  columns=flexible,
+  basicstyle={\small\ttfamily},
+  numbers=none,
+  numberstyle=\tiny\color{gray},
+  keywordstyle=\color{blue},
+  commentstyle=\color{dkgreen},
+  stringstyle=\color{mauve},
+  breaklines=true,
+  breakatwhitespace=true,
+  tabsize=3
+}
+\usepackage[nolist,nohyperlinks]{acronym}
+\usepackage[superscript]{cite}
+\usepackage{tabularx}
+\usepackage{float}
+\usepackage[group-separator={,}]{siunitx}
+\usepackage{geometry}
+ \geometry{
+ a4paper,
+ papersize={210mm,279mm},
+ left=12.73mm,
+ top=20.3mm,
+ marginpar=3.53mm,
+ textheight=238.4mm,
+ right=12.73mm,
+ }
+
+\setlength{\columnsep}{6.54mm}
+
+%\linenumbers %%% Turn on line numbers here
+
+\renewcommand{\familydefault}{\sfdefault}
+
+\captionsetup[figure]{labelfont=bf,textfont=normalfont}
+%\captionsetup[subfigure]{labelfont=bf,textfont=normalfont}
+
+%%%% comment out the below for the other title option
+\makeatletter
+\def\@maketitle{
+\raggedright\newpage
+  \noindent
+  \vspace{0cm}
+  \let\footnote\thanks{\hskip -0.4em \huge \textbf{{\@title}} \par}
+    \vskip 1.5em
+    {\large
+      \lineskip.5em
+      \begin{tabular}[t]{l}
+      \raggedright\@author\end{tabular}\par}
+    \vskip 1em
+    \@date\par
+  \vskip 1.5em
+  }
+\makeatother
+
+\begin{document}
+
+\title{Exercise 2 Report}
+
+\author[1]{Paddy Milner}
+
+\affil[1]{Department of Physics, University of Bristol}
+\renewcommand\Affilfont{\itshape\small}% chktex 6 
+
+\date{17.1.2026}
+\maketitle
+Word count: 1125
+\begin{abstract}
+
+    In this exercise, a large amount of experimental data was manipulated and analysed in order to determine the correlations between various experimental properties, as well as to determine the strength of the linear relationship between fall height and fall time for sheres of different materials and sizes.
+
+\end{abstract}
+
+\section{Introduction}
+
+In this exercise, a large set of experimental data was supplied for manipulation and analysis. We will first clean the data set, removing any incorrect or anomalous data points, followed by some initial analysis on the data. We will then determine the strength of the correlation between the different features of the experimental data, with a particular focus on how the various parameters affect the fall time. We will the calculate a linear regression using multiple methods, and test each method's effectiveness when compared both to the original data and to each other.
+
+\section{Theory and Methods}
+
+The sample data was first trimmed to remove any incorrect data, first by removing any lines with incorrect material names, then removing any lines containing non-numeric data, and finally by correcting any negative data, as all values should be positive given the units the data is given in.
+
+Next, the data was filtered by material, and for each material a plot was made of fall time against height, with the data points being coloured based on radius.
+
+In order to determine the correlation between each feature of the dataset, the pandas corr() function was used to generate a correlation matrix, which was then plotted, using a colour map to indicate the strength of the correlation between two features.
+
+A linear correlation was then calculated from the data set, consisting of a coefficient for each feature to give the falling time. This linear fit was first analysed visually by plotting both the true values for falling time as well as the values the linear regresion predicted on the same axis, as well as a line of best fit based on the predicted data. The fit was then analysed quantitively by splitting the data set into a training set and a testing set. The linear fit was generated off the training data, and both the test data and the models prediction of the test data were plotted, and the $R^2$ values were compared to measure the accuracy of the linear regression. The residuals of the true data and the predicted data were then plotted against the radius.
+
+Another linear regression model was then calculated, this time using the stochastic gradient descent mathod rather than the least squares regression method, as this avoids matrix inversion, which can reduce sompute time for data sets with a large number of parameters. The linear fit was first calculated based off of unscaled data, to confirm the necessity of scaling. The data was then scaled correctly and the linear regression was then recalculated. This model was compared to the model obtained from the least squares method, and finally the model was generated again using a different loss function to compare the differences. 
+
+\section{Results and discussion}
+
+The cleaned data functioned correctly, with all data being numeric and in a reasonable range, as confirmed by the calculated statistics. The plots of fall time against data appear as expected, with the data for each radius being mostly linear, with an example shown in figure~\ref{fig:timeVheightMag}.
+\begin{figure}
+  \includegraphics[width=1\linewidth]{Images/TimeVHeightMag.png}
+  \caption{Plot of fall time against fall height for all magnesiusm data.}\label{fig:timeVheightMag}
+\end{figure}
+The correlation matrix, shown in figure~\ref{fig:correlation} shows that pressure and temperature have negligable effect on fall time, height has a positive correlation with time, which is to be expected, and density, radius and mass all have significant negative correlations with fall time, which is also intuative, as a larger and therefore heavier sample will fall faster. The only other significant correlations are mass' correlations with density and radius, both of which are positive and are to be expected.
+\begin{figure}
+  \includegraphics[width=1\linewidth]{Images/correlation.png}
+  \caption{A correlation matrix showing the strength and direction of correlation between the variables in the dataset.}\label{fig:correlation}
+\end{figure}
+After calculating the linear regression, the following coefficients were found for each of the dataset's features:
+\begin{center}
+\begin{tabular}{||c|c||}
+ \toprule
+ Feature & Coefficient \\ [0.5ex] 
+ \midrule\midrule
+ Density & $-2.84\times10^{-3}$ \\ 
+ \midrule
+ Radius & -1060 \\
+ \midrule
+ Mass & 34.7 \\
+ \midrule
+ Temperature & -0.0347 \\
+ \midrule
+ Height & 0.0251 \\ [1ex] 
+ \bottomrule
+\end{tabular}
+\end{center}
+All of these values are to be expected, with temperature and pressure having small coefficients due to their weak correlation with fall time, radius and density having negative values due to their negative correlation, and height having a fairly large coefficient after taking into account the somewhat large mean value for height.
+
+The data was then filtered to only contain one material, iron, and using our linear regression model the true values of time were plotted against the values the model predicted based off of the other features, shown in figure~\ref{fig:trueVpred}.
+\begin{figure}
+  \includegraphics[width=1\linewidth]{Images/trueVpred.png}
+  \caption{A plot of the true fall time data and the values predicted by the linear regression model.}\label{fig:trueVpred}
+\end{figure}
+The predicted data is somewhat close to the true values, though it tends to predict a consistently lower fall time, and follows a much stronger linear relationship This could likely be improved by a larger sample set, or a more focused one, perhaps only measureing one material or radius, as there is some varience between them. The dataset was then split up into  training and test data, and the test data was plotted with the predicted values based off of a linear regression calculated from the training data. The data was split up by radius, and the $R^2$ value was vcalculated for each radius for both the true and predicted data. The average $R^2$ value for the true data was $0.812$, whereas the average value for the test data was $0.991$, again showing that the linear regression model has a tendency to predict a stronger linear relationship than the experimental data gives. The difference between the true values and predicted values were recorded and plot against radius, show in firgure~\ref{fig:residuals}.
+\begin{figure}
+  \includegraphics[width=1\linewidth]{Images/residuals.png}
+  \caption{A plot of the difference between the true and predicted values against radius.}\label{fig:residuals}
+\end{figure}
+This shows that there is no strong correlation between model accuracy and radius. Though there is varience between the radii, it does not seem to follow any strong relationship.
+
+A linear regression was then calculated using a stochastic gradient descent method. Initially, the model did not return good results, consistently giving an $R^2$ value on the order of $-1\times10^{35}$. However, after the dataset features were scaled, the model improved greatly, giving an $R^2$ value of $0.847$. From this, the coefficients of the scaled model were de-scaled, and compared to the least-squares method, which can be seen in the table below.
+\begin{center}
+\begin{tabular}{||c|c|c|c||}
+ \toprule
+ Feature & LS Coefficient & SGD Coefficient & Huber Coefficient\\ [0.5ex] 
+ \midrule\midrule
+ Density & $-2.84\times10^{-3}$ & $-2.83\times10^{-3}$ & $-1.37\times10^{-1}$ \\ 
+ \midrule
+ Radius & -1060 & -1056 & -361 \\
+ \midrule
+ Mass & 34.7 & 34.92 & -6.56 \\
+ \midrule
+ Temperature & -0.0347 & -0.0327 & -0.0288 \\
+ \midrule
+ Height & 0.0251 & 0.0251 & 0.0205 \\ [1ex] 
+ \bottomrule
+\end{tabular}
+\end{center}
+After scaling, the SGD model matches the least squares model very well, suggesting the SGD is a good alternative.
+
+This method was retried using the Huber loss function, and the coefficients for this can also be seen in the table. The coefficients for the Huber loss function vary significantly when compared to the other two methods, suggesting it is not a good fit for this dataset.
+
+\section{Conclusion}
+
+Overall, the data shows clear correlation between several measured features of the dataset, with the largest factor influencing the fall time being height, mass, radius and density. The data follows a clear linear relationship, which is quantifiable using a number of approaches, which all give results matching fairly well with the experimantel data, through there is still room for this to be improved.
+
+\end{document}