Understanding CNNs for text classification

4 min readMar 31, 2019

A summary of this arxiv paper from 2018.

The Task:

Represents an analysis into the inner workings of Convolutional Neural Networks(CNNs) for processing text.

A bit of introduction to the model:

CNNs were originally invented for computer vision, they have been known to achieve high performance on NLP(Natural Language Processing) tasks, even when considering relatively simple one-layer models. CNNs ability to interpret models can be used to increase trust in model prediction, analyze errors and improve the model. Model interpretability is nothing but a structured explanation which captures what model has learned.

In this paper, the author attempts to understand how CNNs process text, and then use the information for more practical goals of improving model-level and prediction-level abstraction.

CNNs classify text by working through the following steps.

1-dimensional convolving filters are used as n-gram detectors, each filter specializing in a closely-related family of n-grams.
Max-pooling over time extracts the relevant n-grams for making a decision.
The rest of the network classifies the text based on this information.

To understand this paper one must be familiar with how CNNs process text, summarized in the equations below.

An n-gram of length l is denoted by the first equation. Each word in the n-gram is represented by a vector of a length of n.

Then convolution is performed on the n-gram to obtain Fij.

Max-pooling is performed on Fij, in which the n-gram is collapsed into one value( i is the index of the n-gram, j is the index of the filter).

Let’s consider the collection of pjs across all filters as p, and assume that we are solving a text classification task of c classes. We perform transformation W to p, such that Wp is of dimension c x 1. Applying a softmax layer will output the probabilities of the n-gram for each class.

The author denotes the set of contributing to p as Sp. Conceptually these n-grams are separated into two classed deliberate and accidental. Deliberate n-grams were scored high by their filter and accidental n-grams have a low score. The author defines a threshold for each filter and observes the performance of the model.

with this threshold, the model performance slightly improved from the previous results. This is demonstrated in the MR dataset, which contains snippets of positive or negative reviews, the results of which is shown below:

The author comments on the observed results as follows.

We look at the set of deliberate n-grams: Common intuition suggests that each filter is homogeneous and specializes in detecting a specific class of n-grams. For example, a filter may be specializing in detecting n-grams such as “had no issues”, “had zero issues”, and “had no problems”. We challenge this view and show that filters often specialize in multiple distinctly different semantic classes by utilizing activation patterns which are not necessarily maximized.

The author distinguishes n-grams as naturally occurring and possible. Natural n-grams are observed in large corpus while possible n-grams are a combination of l words from the vocabulary. Possible n-grams are a subset of natural n-grams. Below table shows some top-performing n-grams for Elec model.

Conclusion:

The author concludes as,

The power of finding the relevant features of max pooling can be utilized to identify which n-grams are important to the classification. The n-gram score can be decomposed into word level scores by treating convolution of a filter as a sum of word-level convolutions, which allows examining the word-level composition of the activation. By clustering high-scoring n-grams. All these findings can be used to suggest improvements to model-based and prediction-based interpretability of CNNs for text.

Understanding CNNs for text classification

Conclusion:

Written by Shashwat Koliwad