Randomise v2.1

Permutation-based nonparametric inference

intro   -   using randomise   -   examples   -   theory   -   references

INTRODUCTION

Permutation methods (also known as randomisation methods) are used for inference (thresholding) on statistic maps when the null distribution is not known. The null distribution is unknown because either the noise in the data does not follow a simple distribution, or because non-statandard statistics are used to summarize the data. randomise is a simple permutation program enabling modelling and inference using standard GLM design setup as used for example in FEAT. It can output voxelwise and cluster-based tests, and also offers variance smoothing as an option. For more detail on permutation testing in neuroimaging see Nichols and Holmes (2002).

Test Statistics in Randomise

randomise produces a test statistic image (e.g., ADvsNC_tstat1, if your chosen output rootname is ADvsNC) and sets of P-value images (stored as 1-P for more convenient visualization, as bigger is then "better"). The table below shows the filename suffices for each of the different test statistics available.

Voxel-wise uncorrected P-values are generally only useful when a single voxel is selected a priori (i.e., you don't need to worry about multiple comparisons across voxels). The significance of suprathreshold clusters (defined by the cluster-forming threshold) can be assessed either by cluster size or cluster mass. Size is just cluster extent measured in voxels. Mass is the sum of all statistic values within the cluster. Cluster mass has been reported to be more sensitive than cluster size (Bullmore et al, 1999; Hayasaka & Nichols, 2003).

Accounting for Repeated Measures

Permutation tests do not easily accommodate correlated datasets (e.g., temporally smooth timeseries), as null-hypothesis exchangeability is essential. However, the case of "repeated measurements", or more than one measurement per subject in a multisubject analysis, can sometimes be accommodated.

randomise allows the definition of exchangeability blocks, as specified by the group_labels option. If specfied, the program will only permute observations within block, i.e., only observations with the same group label will be exchanged. See the repeated measures example below for more detail.

Confound Regressors

Unlike with the previous version of randomise, you no longer need to treat confound regressors in a special way (e.g. putting them in a separate design matrix). You can now include them in the main design matrix, and randomise will work out from your contrasts how to deal with them. For each contrast, an "effective regressor" is formed using the original full design matrix and the contrast, as well as a new set of "effective confound regressors", which are then pre-removed from the data before the permutation testing begins. One side-effect of the new, more powerful, approach is that the full set of permutations is run for each contrast separately, increasing the time that randomise takes to run.

More information on the theory behind randomise can be found in the Background Theory section below.


USING randomise

A typical simple call to randomise uses the following syntax:

randomise -i <4D_input_data> -o <output_rootname> -d design.mat -t design.con -m <mask_image> -n 500 -D -T

design.mat and design.con are text files containing the design matrix and list of contrasts required; they follow the same format as generated by FEAT (see below for examples). The -n 500 option tells randomise to generate 500 permutations of the data when building up the null distribution to test against. The -D option tells randomise to demean the data before continuing - this is necessary if you are not modelling the mean in the design matrix. The -T option tells randomise that the test statistic that you wish to use is TFCE (threshold-free cluster enhancement - see below for more on this).

There are two programs that make it easy to create the design matrix, contrast and exchangeability-block files design.mat / design.con / design.grp . The first is the Glm GUI which allows the specification of designs in the same way as in FEAT, and the second is a simple script to allow you to easily generate design files for the two-group unpaired t-test case, called design_ttest2.

randomise has the following thresholding/output options:

These filename extensions are summarized in table below.

Voxel-wise TFCE Cluster-wise
Extent Mass
Raw test statistic _tstat
_fstat
_tfce_tstat
_tfce_fstat
n/a n/a
1 - Uncorrected P _vox_p_tstat
_vox_p_fstat
_tfce_p_tstat
_tfce_p_fstat
n/a n/a
1 - FWE-Corrected P _vox_corrp_tstat
_vox_corrp_fstat
_tfce_corrp_tstat
_tfce_corrp_fstat
_clustere_corrp_tstat
_clustere_corrp_fstat
_clusterm_corrp_tstat_
_clusterm_corrp_tstat_

"FWE-corrected" means that the family-wise error rate is controlled. If only FWE-corrected P-values less than 0.05 are accepted, the chance of one more false positives occurring over space space is no more than 5%. Equivalently, one has 95% confidence of no false positives in the image.

Note that these output images are 1-P images, where a value of 1 is therefore most significant (arranged this way to make display and thresholding convenient). Thus to "threshold at p<0.01", threshold the output images at 0.99 etc.

If your design is simply all 1s (for example, a single group of subjects) then randomise needs to work in a different way. Normally it generates random samples by randomly permuting the rows of the design; however in this case it does so by randomly inverting the sign of the 1s. In this case, then, instead of specifying design and contrast matrices on the command line, use the -1 option.

You can potentially improve the estimation of the variance that feeds into the final "t" statistic image by using the variance smoothing option -v <std> where you need to specify the spatial extent of the smoothing in mm.


EXAMPLES

One-Sample T-test.

To perform a nonparametric 1-sample t-test (e.g., on COPEs created by FEAT FMRI analysis), create a 4D image of all of the images. There should be no repeated measures, i.e., there should only be one image per subject. Because this is a single group simple design you don't need a design matrix or contrasts. Just use:
randomise -i OneSamp4D -o OneSampT -1 -T
Note you do not need the -D option (as the mean is in the model), and omit the -n option, so that 5000 permutations will be performed.

If you have fewer than 20 subjects (approx. 20 DF), then you will usually see an increase in power by using variance smoothing, as in
randomise -i OneSamp4D -o OneSampT -1 -v 5 -T
which does a 5mm HWHM variance smoothing.

Note also that randomise will automatically select one-sample mode for appropriate design/contrast combinations.

Two-Sample Unpaired T-test

To perform a nonparametric 2-sample t-test, create 4D image of all of the images, with the subjects in the right order! Create appropriate design.mat and design.con files.

Once you have your design files run:
randomise -i TwoSamp4D -o TwoSampT -d design.mat -t design.con -m mask -T

Two-Sample Unpaired T-test with nuisance variables.

To perform a nonparametric 2-sample t-test in the presence of nuisance variables, create a 4D image of all of the images. Create appropriate design.mat and design.con files, where your design matrix has additional nuisance variables that are (appropriately) ignored by your contrast.

Once you have your design files the call is as before:

randomise -i TwoSamp4D -o TwoSampT -d design.mat -t design.con -m mask -T

Repeated measures ANOVA

Following the ANOVA: 1-factor 4-levels (Repeated Measures) example from the FEAT manual, assume we have 2 subjects with 1 factor at 4 levels. We therefore have eight input images and we want to test if there is any difference over the 4 levels of the factor. The design matrix looks like

1 0 1 0 0
1 0 0 1 0
1 0 0 0 1
1 0 0 0 0
0 1 1 0 0
0 1 0 1 0
0 1 0 0 1
0 1 0 0 0
where the first two columns model subject means and the 3rd through 5th column model the categorical effect (Note the different arrangement of rows relative to the FEAT example). Three t-contrasts for the categorical effect
0 0 1 0 0
0 0 0 1 0
0 0 0 0 1
are selected together into a single F-contrast
1 1 1

Modify the exchangeability-block information in design.grp to match

1
1
1
1
2
2
2
2
This will ensure that permutations will only occur within subject, respecting the repeated measures structure of the data.

The number of permutations can be computed for each group, and then multiplied together to find the total number of permutations. We use the ANOVA computation for 4 levels, and hence (1+1+1+1)!/1!/1!/1!/1! = 24 possible permutations for one subject, and hence 24 × 24 = 576 total permutations. The call is then similar to the above examples:
randomise -i TwoSamp4D -o TwoSampT -d design.mat -t design.con -f design.fts -m mask -e design.grp -T


BACKGROUND THEORY

A standard nonparametric test is exact, in that the false positive rate is exactly equal to the specified α level. Using randomise with a GLM that corresponds to one of the following simple statistical models will result in exact inference:

  1. One sample t-test on difference measures
  2. Two sample t-test
  3. One-way ANOVA
  4. Simple correlation
Use of almost any other GLM will result in approximately exact inference. In particular, when the model includes both the effect tested (e.g., difference in FA between two groups) and nuisance variables (e.g., age), exact tests are not generally available. Permutation tests rely on an assumption of exchangeability; with the models above, the null hypothesis implies complete exchangeability of the observations. When there are nuisance effects, however, the null hypothesis no longer assures the exchangeability of the data (e.g. even when the null hypothesis of no FA difference is true, age effects imply that you can't permute the data without altering the structure of the data).

Permutation tests for the General Linear Model

For an arbitrary GLM randomise uses the method of Freeman & Lane (1983). Based on the contrast (or set of contrasts defining an F test), the design matrix is automatically partitioned into tested effects and nuisance (confound) effects. The data are first fit to the nuisance effects alone and nuisance-only residuals are formed. These residuals are permuted, and then the estimated nuisance signal is added back on, creating an (approximate) realization of data under the null hypothesis. This realization is fit to the full model and the desired test statistic is computed as usual. This process is repeated to build a distribution of test statistics equivalent under the null hypothesis specified by the contrast(s). For the simple models above, this method is equivalent to the standard exact tests; otherwise, it accounts for nuisance variation present under the null. Note, that randomise v2.0 and earlier used a method due to Kennedy (1995). While both the Freedman-Lane and Kennedy methods are accurate for large n, for small n the Kennedy method can tend to false inflate significances. For a review of these issues and even more possible methods, see Anderson & Robinson (2001)

This approximate permutation test is asymptotically exact, meaning that the results become more accurate with an ever-growing sample size (for a fixed number of regressors). For large sample sizes, with 50-100 or more degrees of freedom, the P-values should be highly accurate. When the sample size is low and there are many nuisance regressors, accuracy could be a problem. (The accuracy is easily assessed by generating random noise data and fitting it to your design; the uncorrected P-values should be uniformly spread between zero and one; the test will be invalid if there is an excess of small P-values and conservative if there is a deficit of small P-values.)

Monte Carlo Permutation Tests

A proper "exact" test arises from evaluating every possible permutation. Often this is not feasible, e.g., a simple correlation with 12 scans has nearly a half a billion possible permutations. Instead, a random sample of possible permutations can be used, creating a Monte Carlo permutation test. On average the Monte Carlo test is exact and will give similar results to carrying out all possible permutations.

If the number of possible permutations is large, one can show that a true, exhaustive P-value of p will produce Monte Carlo P-values between p ± 2√(p(1-p)/n) about 95% of the time, where n is the number of Monte Carlo permutations.

n Confidence limits
for p=0.05
100 0.0500 ± 0.0436
1,000 0.0500 ± 0.0138
5,000 0.0500 ± 0.0062
10,000 0.0500 ± 0.0044
50,000 0.0500 ± 0.0019
The table above shows confidence limits for p = 0.05 for various n. At least 5,000 permutations are required to reduce the uncertainty appreciably, though 10,000 permutations are required to reduce the margin-of-error to below 10% of the nominal alpha. Hence the default is 5000, though if time permits, 10000 is recommended. Use the -n to set the number of permutations (if this number is greater than or equal to the number of possible permutations, an exhaustive test is run.)

'Draft' Analyses to Check for Any Significance

To minimize computational expense a very short 'draft' analysis can be run to screen for any significance. Since the draft analysis won't be very accurate, you use a generous thresold to ensure you'll catch a significant result. Specifically, if you run with 200 permutations and use a FWE significance threshold of p ≤ 0.1, you will almost always (greater than 99.9% chance) detect a true p-value of 0.05 (where "truth" corresponds to running every permutation). If you find anything significant you should re-run with at least 5,000 (ideally 10,000) permutations to get the final result.

Counting Permutations

Exchangeabilty under the null hypothesis justifies the permutation of the data. For n scans, there are n! (n factorial, n×(n-1)×(n-2)×...×2) possible ways of shuffling the data. For some designs, though, many of these shuffles are redundant. For example, in a two-sample t-test, permuting two scans within a group will not change the value of the test statistic. The number of possible permutations for different designs are given below.

Model Sample Size(s) Number of Permutations
One sample t-test on
difference measures
n 2n
Two sample t-test n1,n2 (n1+n2)!   /   ( n1! × n2! )
One-way ANOVA n1,...,nk (n1+n2+ ... + nk)!   /   ( n1! × n2! × ... × nk! )
Simple correlation n n!
Note that the one-sample t-test is an exception. Data are not permuted, but rather their signs are randomly flipped. For all designs except a one-sample t-test, randomise uses a generic algorithm which counts the number of unique possible permutations for each contrast. If X is the design matrix and c is the contrast of interest, then Xc is sub-design matrix of the effect of interest. The number of unique rows in Xc is counted and a one-way ANOVA calculation is used.

Parallelising Randomise

If you are have an SGE-capable system then a randomise job can be split in parallel with randomise_parallel which takes the same input options as the standard randomise binary and then calculates and batches an optimal number of randomise sub-tasks. The parallelisation has two stages - firstly the randomise sub-jobs are run, and then the sub-job results are combined in the final output.

Exchangeability Blocks

The pre-FSL4.1.8 version of randomise had a bug in the -e option that could generate incorrect permutation of scans over subject blocks. Incorrectly permuting scans over subject blocks does two things: Under permutation, it will randomly induce big positive or negative effects (inflating the variability in the numerator of the t statistics over permutations), but it will ALSO increase the residual standard deviation for each fit (inflating the denominator of the t statistic on each permutation). Hence it isn't clear which will dominate, whether the permutation distribution of T values will be artifactually expanded (wrongly decreasing significance), or artifactually contracted (wrongly inflating significance). With some simulations, and with some re-analysis of real data, it appears that the effect of the bug is to always to wrongly deflate significance. Thus, it is anticipated that results with the corrected code will have improved significance.


REFERENCES

MJ Anderson & J Robinson. Permutation Tests for Linear Models. Aust. N.Z. J. Stat. 43(1):75-88, 2001.

Bullmore, ET and Suckling, J and Overmeyer, S and Rabe-Hesketh, S and Taylor, E and Brammer, MJ Global, voxel, and cluster tests, by theory and permutation, for a difference between two groups of structural MR images of the brain. IEEE TMI, 18(1):32-42, 1999.

D Freedman & D Lane. A nonstocastic interpretation of reported significance levels. J. Bus. Econom. Statist. 1:292-298, 1983.

S Hayasaka & TE Nichols. Validating cluster size inference: random field and permutation methods. NeuroImage, 20:2343-2356, 2003

PE Kennedy. Randomization tests in econometrics. J. Bus. Econom. Statist. 13:84-95, 1995.

TE Nichols and AP Holmes. Nonparametric Permutation Tests for Functional Neuroimaging: A Primer with Examples. Human Brain Mapping, 15:1-25, 2002.


Copyright © 2004-2007, University of Oxford. Written by T. Behrens, S. Smith, M. Webster and T. Nichols.