July 7, 2025

Day 30 - Building and Debugging a Multimodal 1D CNN + Transformer for ECG Diagnosis

By Ayomide Jeje

What I Learned

Today at my internship, I worked on one of the most complex and technically demanding aspects of our deep learning project so far — developing a hybrid 1D CNN and Transformer model for ECG classification using the PTB-XL dataset. The goal of this model is to analyze electrocardiogram (ECG) signals from multiple perspectives: the raw time-domain waveform, the frequency-domain representation (obtained through Fast Fourier Transform), and patient metadata such as age and gender. Each of these data modalities offers a unique view of the heart’s electrical activity, and our aim is to combine them to improve classification accuracy. To achieve this, I designed a multimodal model using TensorFlow’s Functional API. The model is structured to take in three separate inputs: one for the raw signal, one for the FFT-transformed frequency features, and one for the metadata. Each of these inputs is passed through its own sub-network. The time and frequency signals are processed using separate stacks of 1D Convolutional Neural Networks (CNNs) followed by Transformer encoder layers, allowing the model to learn both local features and long-range dependencies in the sequences. The metadata is passed through dense layers to integrate non-signal information. These three outputs are then concatenated and fed through fully connected layers to produce the final classification.

Blockers

No Blockers

Reflection

Today was one of those days where things didn’t go as planned but I ended up learning more than I expected. The task seemed clear when I began: continue building and training a deep learning model that combines 1D CNNs and Transformers to classify ECG signals from the PTB-XL dataset. I was excited because the model design was ambitious a multimodal approach that merges raw time-domain signals, frequency-domain FFT features, and patient metadata into one unified classifier. But as I quickly learned, ambition often brings complexity. I started by finalizing the architecture. Using TensorFlow’s Functional API, I structured the model to handle three separate inputs. Each branch processed a different data modality, and the outputs were meant to converge in a dense fusion layer. In theory, it was elegant. But in practice, I hit a wall. The training process failed immediately, throwing a shape mismatch error something like, “expected shape (None, 250, …), got something else.” At first, I assumed it would be a quick fix. Maybe a reshape here or a padding layer there. But as I traced through the pipeline, I realized this was deeper than a surface-level bug. The frequency-domain features I had extracted didn’t match the model’s expectations. Somewhere between preprocessing and model compilation, the data had drifted from the format I assumed it was in.That discovery was frustrating, but also humbling. I’d forgotten how easy it is to introduce silent inconsistencies when handling multiple data sources. I had focused so much on designing a powerful model that I didn’t double-check whether the data I was feeding it still aligned with those assumptions. It reminded me that deep learning isn’t just about flashy architectures it’s about plumbing. Making sure everything fits together underneath.