Update to trimr: A response time trimming package in R

In 2015, I published my first ever R package: “trimr: An implementation of common response time trimming methods”. To date, this package has had over 5,600 downloads from CRAN, which is unbelievable!

3 years is a long time for a package, so I thought it was time to update the package, fix some bugs, and implement some additional functionality. To this end, I have spent some time over the past few days working on a new release. This new release is version 1.1.0. I will spend some time testing this version before updating it on CRAN; at present, it is only available at GitHub.

In this post I will provide an overview of the main changes since the previous version. I will also provide an updated tutorial on how to use trimr from the previous tutorial released in 2015.

If you use trimr in your work, I would really appreciate a citation to the package. Please see the references section for how to do this.


Contents

Here’s a table of contents for the blog post. If you are here just to learn about how to use trimr, then just skip over the list of updates and get stuck right in:

Overview of Updates

Installing trimr

Example Data

Overview of Trimming Methods

Absolute Value Criterion:

Standard Deviation Criterion

Recursive / Moving Criterion

Data from Factorial Designs

References


Overview of Updates

  1. Bug fixes. Since its release in 2015, I have been contacted by many users kindly pointing out some bugs in trimr. Some of these were very minor and won’t be mentioned here, but below I outline some of the key bugs that are fixed in the newest release. Note that some issues remain, but are less urgent; you can find a list of issues and add your own if you find any here.

    • There was an error in the sdTrim function when the user was asking for trimming at the participant and condition level. It would just return a blank data frame (ooops, sorry!).
    • The recursive trimming functions would break when there were more than 97 trials.
  2. Flexible naming of columns in file containing the data. In the previous version of trimr, users were forced to name their columns in a fixed way so that the package knew (for example) which column contained the response time data, which column contained the experimental conditions, etc. With massive thanks to contributions made by Ed Berry, users can now specify their own names for the columns, and tell trimr where to find the relevant data. This is a huge improvement, so a massive thanks to Ed!!

  3. All trimming functions now return trial-level data. The two trimming methods I expect most users are utilising in trimr are the absoluteTrim method and the sdTrim method. Both of these methods allowed users to dictate the format for how the trimmed data should be returned:

    • mean returns calculated mean RT per participant per condition after trimming
    • median returns calculated median RT per participant per condition after trimming
    • raw returns trial-level data after the trials with trimmed RTs are removed

    However, this “returnType” argument was not available for the recursive / moving criterion trimming methods. In particular, I received several email requests for an update to trimr to allow return-type of “raw” for these methods to allow the user to model their data using linear mixed effect modelling (which often requires trial-level, rather than aggregated, participant data). These different return-types are now implemented for all methods.


Installing trimr

This version of trimr will soon be submitted to the official R package repository, CRAN. When it is available on CRAN (I will provide an update here when this is the case), you can install trimr directly by opening R and typing:

install.packages("trimr")

To install the latest version of trimr (which should always be considered developmental), you will first need to install the devtools package and install trimr directly from GitHub by using the following commands:

# install devtools
install.packages("devtools")

# install trimr from GitHub
devtools::install_github("JimGrange/trimr")

Example Data

Of course, you will be using trimr on your own data. This data will be a data frame that contains your experimental data. At present, trimr only accepts data in wide format, but it is on my to-do list to allow long format entry. This data frame should contain (at a minimum) columns that specify your participant numbers, your experimental conditions, your response time data, and your accuracy data.

In previous versions of trimr your data had to contain columns with the same names used below. However, you can now specify your own column names. The default values are the old required values meaning no code written for previous versions of trimr will be broken by this change. You can have other columns in your data file (which will just be ignored by trimr).

Whilst you are getting use to trimr’s functionality, you can use some example data that ships with the package.

# load the trimr package
library(trimr)

# activate the data
data(exampleData)

# look at the top of the data
head(exampleData)
##   participant condition   rt accuracy
## 1           1    Switch 1660        1
## 2           1    Switch  913        1
## 3           1    Repeat 2312        1
## 4           1    Repeat  754        1
## 5           1    Switch 3394        1
## 6           1    Repeat  930        1

This data is simulated, and has data from 32 subjects. This data is from a task switching experiment, where RT and accuracy was recorded for two experimental conditions: Switch, when the task switched from the previous trial, and Repeat, when the task repeated from the previous trial. (If you have data from a factorial design (i.e., the condition codes are spread over more than one column), then please see the final section of this vignette for how to deal with this in trimr).

The example data consists of 4 columns:

  1. participant: Codes the number of each participant in the experiment
  2. condition: In this example, there are two experimental conditions: “Switch”, and “Repeat”.
  3. rt: Logs the response time of the participant in milliseconds.
  4. accuracy: Logs the accuracy of the response. 1 codes a correct response, 0 an error response.

If the columns of your data frame (from your real experiment) have different names from the defaults these can be specified in the calls to the trimming functions (described later). For now, here is an example:

# perform the trimming (note this is just an example and won't run)
trimmedData <- absoluteRT(data = exampleData, 
                          pptVar = "id", condVar = "cond", rtVar = "RT", 
                          accVar = "correct", minRT = 150, maxRT = 2000, 
                          digits = 0)

This is the functionality added by Ed Berry. In this call, the arguments are as follows:

  • pptVar: The name of the column holding participant identification numbers
  • condVar: The name of the column holding the different conditions
  • rtVar: The name of the column holding response time data
  • accVar: The name of the column holding accuracy data

Overview of Trimming

The trimming functions provided by trimr—at present—fall broadly into three families (seen here with their trimr function names). Click on each family member to be taken to the relevant overview & tutorial:

  1. Absolute Value Criterion: involves identifying an absolute upper- and lower-limit on RTs to include in the final analysis (e.g., “RTs faster than 150ms and slower than 2,000ms were excluded from data analysis”).
    • absoluteRT
  2. Standard Deviation Criterion: whereby the upper limit for RT removal is based on a standard deviation criterion (e.g., “RTs slower than 2.5 standard deviations about the mean for each participant for each experimental condition were excluded from data analysis”).
    • sdTrim
  3. Recursive / Moving Criterion: are various function introduced by Van Selst & Jolicoeur (1994) which try to minimise bias when the number of trials entering the trimming function can vary.
    • nonRecursive
    • modifiedRecursive
    • hybridRecursive

Absolute Value Criterion

The absolute value criterion is the simplest of all of the trimming methods available (except of course for having no trimming). An upper- and lower-criterion is set by the user, and any response time that falls outside of these limits are removed. The function that performs this trimming method in trimr is called absoluteRT.

absoluteRT

In this function, the user declares lower- and upper-criterion for RT trimming (minRT and maxRT arguments, respectively); RTs outside of these criteria are removed. Note that these criteria must be in the same unit as the RTs are logged in within the data frame being used. The function also has some other important arguments:

  • omitErrors: If the user wishes error trials to be removed from the trimming, this needs to be set to TRUE (it is set to this by default). Alternatively, some users may wish to keep error trials included. Therefore, set this argument to FALSE.
  • returnType: Here, the user can control how the data are returned.
    • “raw” returns trial-level data after the trials with trimmed RTs are removed;
    • “mean” returns calculated mean RT per participant per condition after trimming;
    • “median” returns calculated median RT per participant per condition after trimming. This is set to “mean” by default.
  • digits: Controls how many digits to round the data to after trimming. If the user has a data frame where the RTs are recorded in seconds (e.g., 0.657), this argument can be left at its default value of 3. However, if the data are logged in milliseconds, it might be best to change this argument to zero, so there are no decimal places in the rounding of RTs (e.g., 657).

In this first example, let’s trim the data using criteria of RTs less than 150 milliseconds and greater than 2,000 milliseconds, with error trials removed before trimming commences. Let’s also return the mean RTs for each condition, and round the data to zero decimal places.

# perform the trimming
trimmedData <- absoluteRT(data = exampleData, minRT = 150, maxRT = 2000, 
                          omitErrors = TRUE, returnType = "mean", 
                          digits = 0)

# look at the top of the data
head(trimmedData)
##   participant Switch Repeat
## 1           1    901    742
## 2          10    793    669
## 3          11   1054    943
## 4          12    880    662
## 5          13    914    773
## 6          14   1000    817

Just to show how the other arguments are used, let’s now do the same trimming but return the raw (i.e., trial-level) data rather than just participant averages (by changing the returnType argument input):

# perform the trimming
trimmedData <- absoluteRT(data = exampleData, minRT = 150, maxRT = 2000, 
                          omitErrors = TRUE, returnType = "raw", 
                          digits = 0)

# look at the top of the data
head(trimmedData)
##    participant condition   rt accuracy
## 1            1    Switch 1660        1
## 2            1    Switch  913        1
## 4            1    Repeat  754        1
## 6            1    Repeat  930        1
## 7            1    Switch 1092        1
## 11           1    Repeat  708        1

You will see in this case that the returned data frame is the same format as the input, but now the trimmed trials are removed.


Standard Deviation Criterion

This trimming method uses a standard deviation multiplier as the upper criterion for RT removal (users still need to enter a lower-bound manually via minRT). For example, this method can be used to trim all RTs 2.5 standard deviations above the mean RT. This trimming can be done per condition (e.g., 2.5 SDs above the mean of each condition), per participant (e.g., 2.5 SDs above the mean of each participant), or per condition per participant (e.g., 2.5 SDs above the mean of each participant for each condition).

sdTrim

In this function, the user declares a lower-bound on RT trimming (e.g., 150 milliseconds) and an upper-bound in standard deviations. The value of standard deviation used is set by the SD argument. How this is used varies depending on the values the user passes to two important additional function arguments:

  • perCondition: If set to TRUE, the trimming will occur above the mean of each experimental condition in the data file
  • perParticipant: If set to TRUE, the trimming will occur above the mean of each participant in the data file

Note that if both are set to TRUE, the trimming will occur per participant per condition (e.g., if SD is set to 2.5, the function will trim RTs 2.5 SDs above the mean RT of each participant for each condition).

In this example, let’s trim RTs faster than 150 milliseconds, and greater than 3 SDs above the mean of each participant, and return the mean RTs:

trimmedData <- sdTrim(data = exampleData, minRT = 150, sd = 3, 
                      perCondition = FALSE, perParticipant = TRUE, 
                      returnType = "mean", digits = 0)

# look at the top of the data
head(trimmedData)
##   participant Switch Repeat
## 1           1   1042    775
## 2          10    779    666
## 3          11   1082    964
## 4          12    871    652
## 5          13    914    773
## 6          14   1034    827

Now, let’s trim per condition per participant:

# trim the data
trimmedData <- sdTrim(data = exampleData, minRT = 150, sd = 3, 
                      perCondition = TRUE, perParticipant = TRUE, 
                      returnType = "mean", digits = 0)

# look at the top of the data
head(trimmedData)
##   participant Switch Repeat
## 1           1   1099    742
## 2          10    784    660
## 3          11   1079    968
## 4          12    874    644
## 5          13    911    776
## 6          14   1038    827

Recursive / Moving Criterion

Three functions in this family implement the trimming methods proposed & discussed by van Selst & Jolicoeur (1994): nonRecursive, modifiedRecursive, and hybridRecursive. van Selst & Jolicoeur noted that the outcome of many trimming methods is influenced by the sample size (i.e., the number of trials) being considered, thus potentially producing bias. For example, even if RTs are drawn from identical positively-skewed distributions, a “per condition per participant” SD procedure (see sdTrim above) would result in a higher mean estimate for small sample sizes than larger sample sizes.

This bias was shown to be removed when a “moving criterion” (MC) was used; this is where the SD used for trimming is adapted to the sample size being considered.

nonRecursive

The non-recursive method proposed by van Selst & Jolicoeur (1994) is very similar to the standard deviation method outlined above with the exception that the user does not specify the SD to use as the upper bound. The SD used for the upper bound is instead decided by the sample size of the RTs being passed to the trimming function, with larger SDs being used for larger sample sizes.

Also, the function only trims per participant per condition.

The nonRecursive function checks the sample size of the data being passed to it, and looks up the SD criterion required for the data’s sample size. The function looks in a data file contained in trimr called linearInterpolation. Should the user wish to see this data file (although the user will never need to access it if they are not interested), type:

# load the linear interpolation data file
data(linearInterpolation)

# show the first 20 rows (there are 100 in total)
linearInterpolation[1:20, ]
##    sampleSize modifiedRecursive nonRecursive
## 1           4             8.000       1.4580
## 2           5             6.200       1.6800
## 3           6             5.300       1.8410
## 4           7             4.800       1.9610
## 5           8             4.475       2.0500
## 6           9             4.250       2.1200
## 7          10             4.110       2.1700
## 8          11             4.000       2.2200
## 9          12             3.920       2.2460
## 10         13             3.850       2.2740
## 11         14             3.800       2.3100
## 12         15             3.750       2.3260
## 13         16             3.728       2.3390
## 14         17             3.706       2.3520
## 15         18             3.684       2.3650
## 16         19             3.662       2.3780
## 17         20             3.640       2.3910
## 18         21             3.631       2.3948
## 19         22             3.622       2.3986
## 20         23             3.613       2.4024

Notice there are two columns. This current function will only look in the nonRecursive column; the other column is used by the modifiedRecursive function, discussed below. If the sample size of the current set of data is 16 RTs (for example), the function will use an upper SD criterion of 2.359, and will proceed much like the sdTrim function’s operations.

Here is how to use the function:

# trim the data
trimmedData <- nonRecursive(data = exampleData, minRT = 150, digits = 0, 
                            returnType = "mean")

# see the top of the data
head(trimmedData)
##   participant Switch Repeat
## 1           1   1053    732
## 2          10    779    652
## 3          11   1066    960
## 4          12    871    638
## 5          13    900    766
## 6          14   1018    817

Important update!

Note that in previous versions of trimr, the user could only return the mean trimmed RTs (i.e., there was no “returnType” argument for this function). This has now been changed for the newest release of trimr.

To show this new functionality, let’s repeat the above, but return the raw (i.e., trial-level) data:

# trim the data
trimmedData <- nonRecursive(data = exampleData, minRT = 150, digits = 0, 
                            returnType = "raw")

# see the top of the data
head(trimmedData)
##   participant condition   rt accuracy
## 1           1    Switch 1660        1
## 2           1    Switch  913        1
## 4           1    Switch 1092        1
## 6           1    Switch 2473        1
## 7           1    Switch 2211        1
## 8           1    Switch 1842        1

modifiedRecursive

The modifiedRecursive function is more involved than the nonRecursive function. This function performs trimming in cycles. It first temporarily removes the slowest RT from the distribution; then, the mean of the sample is calculated, and the cut-off value is calculated using a certain number of SDs around the mean, with the value for SD being determined by the current sample size. In this procedure, required SD decreases with increased sample size (cf., the nonRecursive method, with increasing SDs with increasing sample size; see the linearInterpolation data file above); see Van Selst and Jolicoeur (1994) for justification.

The temporarily removed RT is then returned to the sample, and the fastest and slowest RTs are then compared to the cut-off, and removed if they fall outside. This process is then repeated until no outliers remain, or until the sample size drops below four. The SD used for the cut-off is thus dynamically altered based on the sample size of each cycle of the procedure, rather than static like the nonRecursive method.

# trim the data
trimmedData <- modifiedRecursive(data = exampleData, minRT = 150, digits = 0, 
                                 returnType = "mean")

# see the top of the data
head(trimmedData)
##   participant Switch Repeat
## 1           1   1047    717
## 2          10    779    647
## 3          11   1075    931
## 4          12    871    638
## 5          13    911    763
## 6          14   1014    799

As with the nonRecursive function, in the new release of trimr you can now obtain trial-level trimmed data:

# trim the data
trimmedData <- modifiedRecursive(data = exampleData, minRT = 150, digits = 0, 
                                 returnType = "raw")
## Warning: package 'bindrcpp' was built under R version 3.4.4
# see the top of the data
head(trimmedData)
##   participant condition   rt accuracy
## 1           1    Switch 1660        1
## 2           1    Switch  913        1
## 3           1    Switch 1092        1
## 4           1    Switch 2473        1
## 5           1    Switch 2211        1
## 6           1    Switch 1842        1

hybridRecursive

van Selst and Jolicoeur (1994) reported slight opposing trends of the non-recursive and modified-recursive trimming methods (see page 648, footnote 2). They therefore, in passing, suggested a “hybrid-recursive” method might balance the opposing trends. The hybrid-recursive method simply takes the average of the non-recursive and the modified-recursive methods.

Please note that this function does not allow for different types of “returnType”; it only returns mean data per participant, per condition.

# trim the data
trimmedData <- hybridRecursive(data = exampleData, minRT = 150, digits = 0)

# see the top of the data
head(trimmedData)
##   participant Switch Repeat
## 1           1   1050    724
## 2          10    779    649
## 3          11   1071    946
## 4          12    871    638
## 5          13    906    764
## 6          14   1016    808

Data from Factorial Designs

In the example data that ships with trimr, the RT data comes from just two conditions (Switch vs. Repeat), which are coded in the column “condition”. However, in experimental psychology, factorial designs are prevalent, where RT data comes from more than one independent variable, with each IV having multiple levels. How can trimr deal with this format?

First, let’s re-shape the exampleData set to how data might be stored from a factorial design. Let there be two IVs, each with two levels:

  1. taskSequence: Switch vs. Repeat
  2. reward: Reward vs. NoReward

The taskSequence factor is coding whether the task has Switched or Repeated from the task on the previous trial (as before). The reward factor is coding whether the participant was presented with a reward or not on the current trial (presented randomly). Let’s reshape our data frame to match this fictitious experimental scenario:

# get the example data that ships with trimr
data(exampleData)

# pass it to a new variable
newData <- exampleData

# add a column called "taskSequence" that logs whether the current task was a 
# repetition or a switch trial (which is currently coded in the "condition")
# column
newData$taskSequence <- newData$condition

# add a column called "reward" that logs whether the participant received a 
# reward or not. Fill it with random entries, just for example. This uses Rs
# "sample" function
newData$reward <- sample(c("Reward", "NoReward"), nrow(newData), 
                         replace = TRUE)

# delete the "condition" column
newData <- subset(newData, select = -condition)

# now let's look at our new data
head(newData)
##   participant   rt accuracy taskSequence   reward
## 1           1 1660        1       Switch NoReward
## 2           1  913        1       Switch   Reward
## 3           1 2312        1       Repeat NoReward
## 4           1  754        1       Repeat   Reward
## 5           1 3394        1       Switch NoReward
## 6           1  930        1       Repeat NoReward

This now looks how data typically comes in from a factorial design. Now, to get trimr to work on this, we need to create a new column called “condition”, and to place in this column the levels of all factors in the design. For example, if the first trial in our newData has taskSequence = Switch and reward = NoReward, we would like our condition entry for this trial to read “Switch_NoReward”. This is simple to do using R’s “paste” function. (Note that this code can be adapted to deal with any number of factors.)

# add a new column called "condition", and fill it with information from both 
# columns that code for our factors
newData$condition <- paste(newData$taskSequence, "_", newData$reward, sep = "")

# let's again look at the data
head(newData)
##   participant   rt accuracy taskSequence   reward       condition
## 1           1 1660        1       Switch NoReward Switch_NoReward
## 2           1  913        1       Switch   Reward   Switch_Reward
## 3           1 2312        1       Repeat NoReward Repeat_NoReward
## 4           1  754        1       Repeat   Reward   Repeat_Reward
## 5           1 3394        1       Switch NoReward Switch_NoReward
## 6           1  930        1       Repeat NoReward Repeat_NoReward

Now we can pass this data frame to any function in trimr, and it will work perfectly:

# trim the data
trimmedData <- sdTrim(newData, minRT = 150, sd = 2.5, digits = 0, 
                      returnType = "mean")

# check it worked
head(trimmedData)
##   participant Switch_NoReward Switch_Reward Repeat_NoReward Repeat_Reward
## 1           1            1062          1034             737           726
## 2          10             773           785             652           648
## 3          11            1085          1046             970           944
## 4          12             865           878             655           620
## 5          13             903           898             780           753
## 6          14            1048           996             785           844

References

  • Grange, J.A. (2015). trimr: An implementation of common response time trimming methods. R package version 1.0.1. https://cran.r-project.org/web/packages/trimr/index.html

  • Van Selst, M. & Jolicoeur, P. (1994). A solution to the effect of sample size on outlier elimination. Quarterly Journal of Experimental Psychology, 47 (A), 631—650.

Previous