Author: Krzysztof Tołpa

EEG Preprocessing for an Oddball Paradigm#

Preprocessing is a critical step in EEG analysis. It involves cleaning the data to remove noise and artifacts — signals that are not of neural origin — which can otherwise contaminate our results. We use MNE-Python library accompanied with adapted from EEGLAB functions: rejchanspec and trimoutlier. We also utilize mne-icalabel plugin for automatic classification ICA components.

Our pipeline covers commonly applied steps:

  1. Loading Data: Importing the raw EEG recording.

  2. Channel Management: Handling channel locations, names, and types.

  3. Filtering: Applying high-pass and low-pass filters to isolate the frequency bands of interest.

  4. Artifact Detection: Identifying and marking bad channels and noisy time segments.

  5. Artifact Removal with ICA: Using Independent Component Analysis (ICA) to remove stereotyped artifacts like eye blinks and muscle activity.

  6. Interpolation & Resampling: Reconstructing bad channels and downsampling the data to make it more manageable.

  7. Epoching: Segmenting the continuous data into trials (epochs) centered around our experimental events.

Let’s get started!

1. Setup: Importing Necessary Libraries#

First things first, we need to import all the Python packages we’ll use throughout the analysis.

Filr with preprocessing_utils can be found on github.

[1]:
import mne
import numpy as np
from os.path import join
import preprocessing_utils as utils
from mne.preprocessing import ICA
from mne_icalabel import label_components
import matplotlib
matplotlib.use('TkAgg') # so that the figures will pop up outside the cells

import warnings
warnings.filterwarnings('ignore', category=RuntimeWarning)

2. Loading and Inspecting Raw Data#

Now, we’ll define the file paths and load our raw data. We are working with data in the BrainVision format, which consists of three files (.vhdr, .vmrk, .eeg). MNE conveniently handles this by pointing to the header file (.vhdr).

We set preload=True to load the entire dataset into memory. This is necessary for many preprocessing steps, especially filtering and ICA.

[2]:
subjects_dir = 'data'
subject = 'sample_subject'

filepath = join(subjects_dir,subject,'_eeg','_raw')

filename = subject+'_oddball.vhdr'

raw = mne.io.read_raw_brainvision(join(filepath, filename), preload=True)

Extracting parameters from /Volumes/UMK/oddball/subjects/sample_subject/_eeg/_raw/sample_subject_oddball.vhdr...
Setting channel info structure...
Reading 0 ... 1329199  =      0.000 ...  1329.199 secs...

Visual Inspection#

Before diving into any processing, let’s look at the data. This first-pass visual check can help you spot major problems like dead channels (completely flat lines), extreme electrical noise, or large movement artifacts. The raw.plot() function is an interactive tool that lets you scroll through your data. It can be run at any step of the preprocessing to monitor the results.

[3]:
# raw.plot()

3. Channel Management#

Properly defining our channels is a foundational step. This involves removing non-essential channels and ensuring the remaining ones are correctly typed before we assign their 3D locations.

  • Dropping Unwanted Channels: We first remove any channels that are not EEG sensors, such as EMG (muscle) or EOG (eye movement) channels.

  • Setting Channel Types: We then ensure that all channels intended as EEG sensors are correctly labeled. Mastoid channels like TP9 and TP10 were assigned 'misc' category since they were attached to the earlobes. We explicitly set them to 'eeg', their location will be adjusted when we load individual montage.

[4]:
raw = raw.copy().drop_channels(["EMG", "EOG"])

raw = raw.copy().set_channel_types({"TP9": "eeg", "TP10":"eeg"})

Setting the Montage (Electrode Locations)#

To know where our electrodes were on the participant’s head, we need to apply a montage. A montage file contains the 3D coordinates of each sensor. This is essential for:

  • Topographical plots (visualizing scalp activity).

  • Interpolating bad channels based on their neighbors.

  • Source localization analyses.

Here, we’re loading a subject’s individual .bvct file, which is a format from the Captrak digitizer system, and applying it to our raw object.

[5]:
# Get channel positions

montage_path = join(subjects_dir,subject,'_eeg','montage','captrak')
montage_name =f'{subject}_captrak.bvct'
montage = mne.channels.read_dig_captrak(join(montage_path, montage_name))
raw.set_montage(montage)
[5]:
General
Filename(s) sample_subject_oddball.eeg
MNE object type RawBrainVision
Measurement date 2018-10-31 at 11:19:57 UTC
Participant Unknown
Experimenter Unknown
Acquisition
Duration 00:22:10 (HH:MM:SS)
Sampling frequency 1000.00 Hz
Time points 1,329,200
Channels
EEG
Head & sensor digitization 130 points
Filters
Highpass 0.00 Hz
Lowpass 280.00 Hz

4. Filtering and Artifact Detection#

Now we begin the core cleaning process. The order of operations here is important.

High-Pass Filtering#

We start with a high-pass filter to remove slow drifts in the data. These drifts can be caused by sweat, skin potentials, or slow electrode movement. A cutoff of 1 Hz is common for ERP analysis, as it effectively removes these drifts while preserving the faster neural signals of interest.

[6]:
# High-Pass Filtering (>1 Hz)
raw = raw.filter(l_freq = 1, h_freq=None, method='fir', fir_window='hamming', n_jobs=-1)
Filtering raw data in 1 contiguous segment
Setting up high-pass filter at 1 Hz

FIR filter parameters
---------------------
Designing a one-pass, zero-phase, non-causal highpass filter:
- Windowed time-domain design (firwin) method
- Hamming window with 0.0194 passband ripple and 53 dB stopband attenuation
- Lower passband edge: 1.00
- Lower transition bandwidth: 1.00 Hz (-6 dB cutoff frequency: 0.50 Hz)
- Filter length: 3301 samples (3.301 s)

Automated Bad Channel and Segment Detection#

Next, we identify and mark bad data. This is a two-step process using custom utility functions, which are often developed in a lab to standardize analysis.

  1. ``rejchanspec`` (Reject Channel by Spectrum): This function identifies channels with abnormal spectral power. The freqlims parameter defines the frequency bands to examine, while stdlims sets the corresponding standard deviation (Z-score) thresholds. For example, a channel is marked as bad if its power within a given band exceeds the specified threshold. The names of these bad channels are then added to the raw.info['bads'] list.

  2. ``trimoutlier`` (Trim Outliers): This function detects and marks time segments containing extreme voltage fluctuations, such as those caused by subject movement. The amplitude_threshold parameter sets the absolute voltage value (e.g., 500 µV) above which a data point is considered an artifact. The point_spread_width parameter then marks a time window (in milliseconds) around any detected artifact as “bad.” This padding ensures that the artifact’s onset and offset are also fully excluded from subsequent analyses. The function creates annotations that label these identified periods as “bad.” MNE functions can then be instructed to ignore these segments during later steps like ICA.

[7]:
# Remove Bad Channels Using REJCHANSPEC
freqlims = [[0.00, 5.00],
            [5.00, 40.00]]
stdlims = [[-5.00, 5.00],
        [-2.50,  2.50]]
raw, specdata = utils.rejchanspec(raw.copy(), freqlims, stdlims)
Removed EEG channels: ['FCC1h', 'Fz', 'FC1', 'FCC2h', 'Cz', 'FFC1h', 'FFC2h']
[8]:
# Remove Bad Samples Using TRIMOUTLIER

amplitude_threshold = 500*1e-6
point_spread_width = 2000 # ms

raw = utils.trimoutlier(raw, amplitude_threshold, point_spread_width)
Selected 120 EEG channels (excluding bads)
No channel removed.

0.0005uV threshold with 2000ms spreading rejected 20.4 sec data, added 9 boundaries.
trimOutlier done. The masks for clean channels and data points are stored in the raw object.

Low-Pass Filtering (for ICA)#

Before running ICA, it’s good practice to low-pass filter the data. High-frequency noise (like muscle activity) can sometimes dominate the signal and prevent ICA from finding a good solution. We apply a 100 Hz low-pass filter here to clean up the signal specifically for the ICA decomposition. We will apply a stricter low-pass filter later for our final analysis.

[9]:
# Low-Pass Filtering (<100 Hz)
raw = raw.filter(l_freq = None, h_freq=100, method='fir', fir_window='hamming', n_jobs=-1)
Filtering raw data in 1 contiguous segment
Setting up low-pass filter at 1e+02 Hz

FIR filter parameters
---------------------
Designing a one-pass, zero-phase, non-causal lowpass filter:
- Windowed time-domain design (firwin) method
- Hamming window with 0.0194 passband ripple and 53 dB stopband attenuation
- Upper passband edge: 100.00 Hz
- Upper transition bandwidth: 25.00 Hz (-6 dB cutoff frequency: 112.50 Hz)
- Filter length: 133 samples (0.133 s)

5. Rereferencing#

The voltage at any one electrode is measured relative to a reference. Here, the data was recorded with FCz as a reference electrode. We can bring it back to the analysis and rereference to the average of all EEG channels.

[10]:
# Bring back FCz channel
raw = mne.add_reference_channels(raw, 'FCz')
raw.set_montage(montage)

# Re-reference to average reference
raw = raw.set_eeg_reference(ref_channels="average")
EEG channel type selected for re-referencing
Applying average reference.
Applying a custom ('EEG',) reference.

6. Independent Component Analysis (ICA) for Artifact Removal#

ICA is a technique for separating statistically independent signals that have been mixed together. In EEG, it’s effective at isolating stereotyped artifacts like:

  • Eye blinks and eye movements

  • Heartbeat signals (ECG)

  • Persistent muscle tension

The process is:

  1. Fit ICA: We decompose the EEG signal into a set of independent components (ICs). We do this only on the “good” data segments, using reject_by_annotation=True.

  2. Label Components: We use the mne-icalabel library to automatically classify each IC into categories like brain, eye blink, muscle, etc. For the best performance, ``ICLabel`` requires data that is average-referenced, filtered between 1-100 Hz, and decomposed using an ``extended infomax`` ICA. We have explicitly followed these steps in our pipeline, which justifies our earlier preprocessing choices and the ICA settings we are about to use.

  3. Remove Artifactual Components: We identify the ICs that were confidently labeled as artifacts (e.g., “eye blink”) and instruct MNE to remove them.

[11]:
# ICA
chans_to_pick = [chan for chan in raw.info['ch_names'] if chan not in raw.info['bads']]
rank = np.linalg.matrix_rank(raw.get_data(reject_by_annotation='omit', picks=chans_to_pick).T)

ica = ICA(
    n_components=rank,
    max_iter="auto",
    method="infomax",
    random_state=692,
    fit_params=dict(extended=True),
)

ica.fit(raw, reject=None, reject_by_annotation=True)

# Classify components using mne-iclabel
ic_labels = label_components(raw, ica, method="iclabel")
Omitting 20454 of 1329200 (1.54%) samples, retaining 1308746 (98.46%) samples.
Fitting ICA to data using 121 channels (please be patient, this may take a while)
Omitting 20454 of 1329200 (1.54%) samples, retaining 1308746 (98.46%) samples.
Selecting by number: 120 components
Computing Extended Infomax ICA
Fitting ICA took 1790.4s.

We can visually inspect identified components. The code below first selects the components that were labeled as “eye blink” with high confidence (over 75%). Then, it generates two types of plots for our inspection:

  • plot_properties(): A detailed summary, including the component’s scalp topography (which should be frontal for blinks) and frequency spectrum.

  • plot_sources(): The component’s activity over time, which should show the characteristic sharp peaks of eye blinks.

[12]:
# ## OPTIONAL: inspect conponents
# idxs = [idx for idx, (label, proba) in enumerate(zip(ic_labels["labels"],ic_labels['y_pred_proba'])) if label in ["eye blink"] and proba > 0.75]
# ica.plot_properties(raw, picks=idxs, verbose=False)
# ica.plot_sources(raw, picks=idxs)

The following code creates a list of component indices to exclude_idx. It identifies any component whose label is not "brain" or "other" and has a prediction probability greater than ****75%****. Finally, ica.apply() removes the activity from these selected components, cleaning the underlying EEG data.

[13]:

labels = ic_labels["labels"] # predicted labels probas = ic_labels['y_pred_proba'] # probabilities for each component to be the labelled class # exclude components if the labelled component is not brain or other and the corresponding probability of artifact component is at least 0.75 exclude_idx = [ idx for idx, (label, proba) in enumerate(zip(labels, probas)) if label not in ["brain", "other"] and proba > 0.75 ] ica.apply(raw, exclude=exclude_idx)
Applying ICA to Raw instance
    Transforming to ICA space (120 components)
    Zeroing out 34 ICA components
    Projecting back using 121 PCA components
[13]:
General
Filename(s) sample_subject_oddball.eeg
MNE object type RawBrainVision
Measurement date 2018-10-31 at 11:19:57 UTC
Participant Unknown
Experimenter Unknown
Acquisition
Duration 00:22:10 (HH:MM:SS)
Sampling frequency 1000.00 Hz
Time points 1,329,200
Channels
EEG and
Head & sensor digitization 131 points
Filters
Highpass 1.00 Hz
Lowpass 100.00 Hz

7. Final Processing Steps#

With the major artifacts removed, we can perform the final steps to get the data ready for epoching.

Final Low-Pass Filter#

We now apply a stricter low-pass filter, typically around 40 Hz is enough for ERP analysis.

[14]:
# Low-Pass Filtering (<40 Hz)
raw = raw.filter(l_freq = None, h_freq=40, method='fir', fir_window='hamming', n_jobs=-1)
Filtering raw data in 1 contiguous segment
Setting up low-pass filter at 40 Hz

FIR filter parameters
---------------------
Designing a one-pass, zero-phase, non-causal lowpass filter:
- Windowed time-domain design (firwin) method
- Hamming window with 0.0194 passband ripple and 53 dB stopband attenuation
- Upper passband edge: 40.00 Hz
- Upper transition bandwidth: 10.00 Hz (-6 dB cutoff frequency: 45.00 Hz)
- Filter length: 331 samples (0.331 s)

Interpolate Bad Channels and Resample#

Two final steps:

  1. Interpolation: Now that the data is clean, we can fill in the data for the channels we marked as “bad” at the beginning. The interpolate_bads() function uses data from neighboring channels to estimate the signal at the bad locations. We then reset_bads=True as the channels have now been repaired.

  2. Resampling: Our original sampling rate was 1000 Hz. This is often higher than needed for ERP analysis. We downsample the data to 256 Hz. This significantly reduces the file size and computational load for subsequent steps, without losing important information (thanks to the Nyquist theorem and our 40 Hz low-pass filter).

[15]:
# Interpolate Bad Channels
raw = raw.interpolate_bads(reset_bads=True, mode='spline')

# Downsample to 256 Hz
raw = raw.resample(sfreq=256)

Setting channel interpolation method to {'eeg': 'spline'}.
Interpolating bad channels.
    Automatic origin fit: head of radius 99.6 mm
Computing interpolation matrix from 121 sensor positions
Interpolating 7 sensors

Final Check for Artifacts and Saving#

As a final quality control check, we can run trimoutlier with a stricter threshold to catch any remaining artifacts. Finally, it’s excellent practice to save the preprocessed continuous data. We save it in MNE’s native .fif format.

[16]:
# Final checks for Bad Samples Using TRIMOUTLIER
channel_sd_lower_bound = -np.inf
channel_sd_upper_bound = np.inf
amplitude_threshold = 100*1e-6
point_spread_width = 1000 # 2000

raw = utils.trimoutlier(raw, amplitude_threshold, point_spread_width)
Selected 128 EEG channels (excluding bads)
No channel removed.

9.999999999999999e-05uV threshold with 1000ms spreading rejected 4.3 sec data, added 4 boundaries.
trimOutlier done. The masks for clean channels and data points are stored in the raw object.

Let’s inspect the results of the preprocessing by plotting raw object data.

[17]:
# raw.plot()

Upon closer inspection, we can still see some smaller, transient artifacts that were not caught by the previous threshold (for example: muscle artifacts around 545s or 648s or increased slow wave activity related to tiredness around 457s). To ensure our data is as clean as possible, we will run the trimoutlier function one more time with a stricter rejection threshold.

[18]:
amplitude_threshold = 50*1e-6
point_spread_width = 1000 # 2000

raw = utils.trimoutlier(raw, amplitude_threshold, point_spread_width)
Selected 128 EEG channels (excluding bads)
No channel removed.

4.9999999999999996e-05uV threshold with 1000ms spreading rejected 45.1 sec data, added 36 boundaries.
trimOutlier done. The masks for clean channels and data points are stored in the raw object.
[19]:
# Save the raw object data
filepath_save = join(subjects_dir, subject,'_eeg','_pre')
#raw.save(join(filepath_save, filename.replace('.vhdr', '_preprocessed_raw.fif')), overwrite=True)

8. Epoching and Baseline Correction#

The final step is to segment the continuous data into epochs (or trials) time-locked to the events of interest in our oddball paradigm.

  • Extract Events: We first parse the event markers (annotations) from the raw data, mapping the trigger codes (e.g., ‘S 5’) to meaningful condition names (standard, deviant, target).

  • Create Epochs: We then slice the data from -200 ms before each event to 800 ms after.

  • Baseline Correction: For each epoch, we calculate the average signal in the pre-stimulus period (-200 ms to 0 ms) and subtract this average from the entire epoch. This ensures that any differences we see after the stimulus are not due to random fluctuations from before the stimulus appeared.

  • Drop Bad Epochs: Any epochs that overlap with time periods we previously marked as “bad” are automatically discarded using reject_by_annotation=True.

The result is an Epochs object, which is the final, analysis-ready data structure. We save this object to a file for the next stage of analysis.

[20]:
# Divide data into epochs with DC detrend and -0.2 - 0s baseline correction
all_events, all_event_id = mne.events_from_annotations(raw, event_id={'Stimulus/S  5': 3, 'Stimulus/S  6': 4,
                                                                        'Stimulus/S  7': 5})

epoched = mne.Epochs(raw, all_events, event_id=dict(standard=3, deviant=4, target=5), tmin=-0.2, tmax=0.8,
                        baseline=(None, 0), reject_by_annotation=True, detrend=0, proj=False,
                        reject=None, preload=True)
epoched.drop_bad()
Used Annotations descriptions: [np.str_('Stimulus/S  5'), np.str_('Stimulus/S  6'), np.str_('Stimulus/S  7')]
Not setting metadata
660 matching events found
Setting baseline interval to [-0.19921875, 0.0] s
Applying baseline correction (mode: mean)
Using data from preloaded Raw for 660 events and 257 original time points ...
39 bad epochs dropped
[ ]:
epoched.save(join(filepath_save, filename.replace('.vhdr', '_epo.fif')), overwrite=True)