GLM Basics
From NITP Summer Course Wiki
Contents |
GLM Basics in MATLAB
The goal of this exercise is to help make the fundamentals of the general linear model more concrete, along with introducing everyone to the basics of programming a data simulation.
We will be doing this in MATLAB. Go ahead and start MATLAB now - there should be an orange icon on the dock at the bottom of your screens.
MATLAB is a commercial software package that is very popular among engineers and scientists (and expensive). If you have had some programming exposure to languages other than MATLAB, picking up the basics should be straightforward.
Amazingly Brief Introduction to Programming in MATLAB
The most novel aspect of the MATLAB language when compared to other scripting languages is that variables are all treated as arrays, or matrices (this is where it gets the name: MATrix LAB).
For example, if you typed:
x = 1
into the MATLAB command window, you would create a 1x1 array called x. Similarly, if you typed:
myString = 'Hello World!' (notice that strings use single quotes in MATLAB)
you would create a new variable called myString, that was an array (or vector in this case) 1x12 in size.
If you would like to confirm this, you can call the function whos, which will return a list of all the variables in memory, along with their dimensions and datatypes (e.g. integer, float, char). Functions like whos that take no arguments as input do not use parentheses when being called.
To make an array, you can use the square-bracket syntax:
y = [1,2,3,4]
contrast that to a new array created by separating the elements with semicolons rather than commas:
y2 = [5;6;7;8]
or as a mixture of the two:
y2 = [1,2;3,4]
You can also easily convert a row vector to a column vector by transposing it, using the single quote:
y2'
MATLAB can tell the difference between the beginning of a string and a transposition based on whether or not the quote character is right behind a variable.
To get to specific elements of a matrix, you can index them using parentheses:
myString(7)
You can also pull out more than one element all at once by using an array as an argument:
myString([11,5,1,12])
MATLAB has a shortcut for creating ranges of evenly spaced numbers, using colons:
every_five = 0:5:25
If you want to count one at a time, you can skip the middle number:
one_to_ten = 1:10
These automatic ranges can also be used to get subsets of a vector:
myString(4:8)
Because MATLAB is built around matrices, basic mathematical operations default to their matrix-algebra versions. For example, see the difference between:
y * y2
and:
y2 * y
If you instead want to calculate a pairwise multiplication rather than a matrix-multiplication, you would type a period before the multiplication sign, like so:
y .* y'
Useful MATLAB Functions
There are several built in functions that might be useful for this exercise.
First and foremost, you should know about the help and lookfor functions.
In simulating data for this exercise, we will need to add random noise to the synthetic dataset. We know that noise should be normally distributed about zero, with a standard deviation sigma, so having a function that would generate that noise for us would be useful. To find one, we can type:
lookfor('normal')
which will return a list of every function that MATLAB can currently call that has the string 'normal' somewhere in its description. Looking at the list, we see a function named RANDN that sounds promising, so to find out more about what it does and how to use it, we can type:
help randn
and the documentation explaining how to call the function and what it returns are displayed in the command window. The combination of 'help' and 'lookfor' will greatly facilitate your finding of functions on your own.
For this exercise, you may find some of the following functions useful (see help for details, and you may not need all of them depending on how your write your code):
- rand (generates arrays, vectors, and scalars of uniformly distributed random numbers)
- randperm (generates a permutation from 1 to N organized in a random order, which is useful for scrambling vectors)
- ones (makes an array or vector containing 1 in each cell)
- zeros (the same as ones, but with zeros)
- pinv (a pseudo-inverse function for inverting matrices - your design matrix will likely find this useful)
- corr and/or corrcoef (calculates the correlation of variables)
- plot (plots data so that you can see it graphically)
- imagesc (an alternative to the plot function, that shows values of a vector or array on a colormapped grid)
Also, you should know that if you want to suppress MATLAB from printing the results of a line of code to the command window, end the line with a semicolon. Entering this:
lots_of_junk = rand(1000)
versus this:
lots_of_junk = rand(1000);
should show you why this can be useful.
If you go to the File->Open menu and choose M-File, you will get a blank editor window where you can enter commands and save it as a script. This will be very useful for testing out code as you build your simulations. You'll need to save your file somewhere in the home directory, where your user account has permission to write files.
MATLAB also has conditional and iterative logic control - if/else statements, for and while loops are all supported.
if 2 < 1
disp('impossible')
else
disp('necessary')
end
for n = 1:10
n
end
while n > 1
n
n = n - 1;
end
Building a Simulation and Fitting a Model
Now that you've been introduced to the very basics of MATLAB, we're going to simulate some data and run the GLM on said data. By changing the characteristics of how the regressors in the model interact with one another, we can see how that changes the estimated parameters.
- To simulate a dataset, we first create some regressors that describe events. Start with 2, to make things easy.
- For example make variable X_1 that is equal to [1 0 1 0 1 0 1 0]' and a variable X_2 that is equal to [1 1 0 0 1 1 0 0]'
These are convenient, because they are completely orthogonal, and therefore uncorrelated.
- Now we need to create some simulated data that includes these variables, like:
Y = 2 * X_1 + 1 * X_2
which creates a beta of 2 for the first variable, and a beta of 1 for the second.
- Next we need to introduce noise into the model, along with a mean:
Y = Y + randn(size(Y)) + 1
- After the simulated data are created, we need to specify the design matrix. This will include both regressors, and a column of ones to model the grand mean:
X = [ones(size(Y)), X_1, X_2]
- Now that we have a design and simulated date, we can estimate our betas:
beta_hat = pinv(X)*Y
The estimates are likely to be poor, because the number of observations is very small.
From Here
Now is a choose your own adventure. Because there is a large range of programming and statistical skills, you should take the remaining time to expand on the technique outlined above to run more simulations at your own level of complexity.
Suggestions include:
- Calculate the residuals from your model.
- See how changing the number of events changes your estimated betas.
- See how changing the correlation between regressors changes your estimated values.
- Generate graphical views of the data and model using plot or imagesc
- Calculate sigma squared.
- Try introducing other signals in the data that aren't modeled in the design matrix, and see how that changes your estimates and statistics.
You can certainly alter your script and rerun it for each variation, but you may also want to run multiple simulations in a loop and store the outcomes for later.
If you get stuck, or are looking for inspiration, see some sample code that Jeanette has written here. But try on your own before going straight here! =)
