Regression with Text Input Using BERT and Transformers

La Javaness R&D
12 min readMar 11, 2022

1. Introduction

Regression, predicting values of numerical variables, is one of the most fundamental tasks in Machine Learning. Linear regression is often the first model introduced in beginner data science courses. Then MLP (Multilayer Perceptron) for regression is often the first model one can use to discover the world of deep learning.

Structured and Unstructured Input

As a data scientist, one should be familiar with regression problems with structured (or tabular) data. In other words, each input is a tuple (row) of numbers or categories (cells) that can be placed into separate fields (columns). For example, to predict the price of some apartments, we could imagine a table where each row represents an apartment and each column an attribute associated with it: year of construction, area, distance to the city centre, energy consumption, availability of parking space etc.

In some cases, we need to predict numerical values from unstructured data: text, images, speech etc. Here are three real-life use case examples:

  • (1) Predict emotion level of text/speech/audios. E.g. We want to assign each text a score between 0 (extremely negative) and 10 (extremely positive). A regression model should be able to predict scores for new text. — This is also an example of the kind ordinal regression or ordinal classification when there exists an order between the classes.
  • (2) Predict house price from description text. E.g., as described in this article, we want to use the description section to evaluate house prices. Actually, we should solve the problem by using a sequence-to-sequence model to extract the numeric/categorical variables and then process the structured data regression model. However, end-to-end regression models can also be processed directly to “translate” text into prices, as shown in the article.
  • (3) Predict people’s age, prices of valuable assets (clothes, bags) from images.

Regression with Text

Thanks to the revolutionary attention mechanisms introduced in 2017, the BERT architecture using this mechanism, and its implementation in the transformers library, we have a powerful solution to deal with text regression. This article discusses regression using BERT and transformers to score emotion levels in a text (the problem described in example 1 above).

If you are familiar with huggingface's models, we see various NLP tasks in its interface: Models — Hugging Face.

Figure 1 — Tasks supported by Huggingface model hub

Surprisingly, regression is not one of them.

Text regression is not far from text classification. Therefore, we can slightly modify some parts of the text classification scheme to make regression work, which is the primary goal of this tutorial article.

The rest of this article is organised as follows:

  • Section 2 describes the dataset used for illustration.
  • Section 3 reminds us how to fine-tune a text-classification model on our dataset.
  • Section 4 describes the modification required to convert the problem into regression.
  • Section 5 discusses the main differences and uses of text classification and text regression models.

2. Dataset

In our internal R&D project, we constructed a French dataset based on public service reviews from Google Maps and Trustpilot, as described in an article by my colleague AL Houceine. The project includes a NER model to detect various kinds of emotion, a classification model to detect the causes linked with those emotions, and a regression model to score the global emotion level. For the (ordinal) regression problem, we annotate each item with one of the following integer scores:

  • 0 (very negative)
  • 1 (negative)
  • 2 (neutral)
  • 3 (positive)
  • 4 (very positive)

We also mask people’s names for privacy reasons. In this article, we only publish the preprocessed datasets for regression, split into three .jsonlines files for train, validation and test, each containing 660, 142 and 142 items, respectively (70%, 15% and 15% of the original dataset). The datasets are available at:

A row of an arbitrary dataset looks like this:

{"id": 457, "text": "Trop d\u00e9sagr\u00e9able au t\u00e9l\u00e9phone \ud83d\ude21! ! !", "uuid": "91c4efaaada14a1b9b050268185b6ae5", "score": 1}

The models only focus on the fields text (raw text) and score (annotated score).

All code blocks in this article should be executed in a Python3-kernel of Jupyter notebook. First, let’s load the datasets using Huggingface’s datasets library.


Let’s have a look at a row in any dataset.



Let’s quickly analyse the class (score) distribution in each dataset.

Figure 2 — Distribution of classes on each set

The distribution on the three splits seems to be similar: lots of “negative” ratings, then “very negative”, “neutral”, “positive” and finally very few “very positive” ratings.

Now we can go to modelling. Formally, this is an example of type ordinal regression. To use BERT’s implementation intransformers , we can think of two modeling approches:

  • As a classification problem: A text will belong to one of the five classes 0 to 4.
  • As an ordinal regression problem: A text will get a score, typically around the interval [0, 4].

Sections 3 and 4 will present two methods, respectively.

3. Fine-tuning with a Text Classification Model


Fine-tuning a downstream task with transformers is a common task, you can revise it by checking out the Huggingface's tutorial. As the main goal of this article is to perform a regression task (section 4), we will briefly remind the classification task in this section as a reference.

To set up, we will define some constants that reflect our need:

  • A French language-model: camembert-base, wrapped in AutoModelForSequenceClassification object
  • A French tokeniser: camembert-base, wrapped in AutoTokenizer object
  • A DataCollatorWithPadding to add padding, which makes all texts the same length.
  • A DataLoader to call data batch by batch during training (so, we will not face memory issues)

Now, we load the model and the tokeniser. (We will see some warnings “Some weights of the model checkpoint at camembert-base were not used when initialising CamembertForSequenceClassification", which is OK since the model has not been trained for the classification task.)


Prepare Datasets

We tokenise the dataset by calling tokenizer. Then, we associate the label attribute to each dataset item.


We can compute metrics to track the model’s improvement during training. Here we retrieve the class with the highest logit (corresponding to the highest probability) for each prediction and compare it with the actual label to calculate the global accuracy score.

We put the output directory for the trained model and the learning parameters into TrainingArguments. With load_best_model_at_end and metric_for_best_model, we will keep several best models (i.e. those with the highest accuracy on the validation set) during training and load the best model at the end.


Combining everything in a Trainer, we start the training:


Note that we rely on the validation set’s accuracy to retrieve the best model. Calling Trainer.evaluate(), we can retrieve the best accuracy attained during training, which is 0.683 (at epoch 16).



In real projects, we need an independent test set to re-evaluate the model. That’s what we do here.



That is it; we have a fine-tuned classifier ready for our use cases. We can call the tokeniser, then the model, to predict a single case.


tensor([3, 2, 1, 0, 4], device='cuda:0')

The predictions seem reasonable. Our classifier is ready, let’s move to the regression model.

4. Fine-tuning with a Regression Model

To build a regression model, we can reuse the whole architecture of the classification one. Indeed, just like the difference of linear regression and logistic/softmax regression models or a 2-layer MLP for regression and a 2-layer MLP for classification (explained for example in Chapter 3 and Chapter 4 or the famous book Dive Into Deep Learning), BERT-based regressors differ from classifiers only in several points:

  • The number of output logits: 1 unit for the regressor vs 5 units (the number of classes in our problem) for the classifier.
  • The loss function, for example, Softmax loss for a multiclass classifier vs Mean-squared loss for the regressor.

Next, we can add additional metrics for the regressor. For example, accuracy does not make sense when discussing house price prediction. Instead, we talk about how close our prediction is — so the metrics should be the mean-squared error (MSE), mean absolute error (MAE) or R2 score.

It suffices to find the right code lines to accommodate these changes. Firstly, let’s copy the setup code for classifiers and change the number of output logits to 1.

Set up

Prepare Datasets

There is one thing to change in this part: the label is no longer a category (represented by an integer); it is a real number that one can use to add, subtract, multiply etc. with the predicted logits. That is why we need to convert label into float(label) as below.


We define several metrics: MSE, MAE and R2 score (though we do not need to use them all) in a function compute_metrics_for_regression and use it later in training args.

To compare with the classification model, let’s also define a notion of “accuracy”: For any score predicted by the regressor, let’s round it (assign it to the closest integer) and assume that is its predicted class. We compare the predicted class and the actual class to build the overall accuracy score.

The training arguments remain the same as for the classifier.

Loss Function

In the case of the AutoModelForSequenceClassification used in the last section for classification, if our output layer has only 1 logit, the Mean Squared Error (MSE) will be applied. So we don’t have to change anything in the default Trainer and can use Trainer to train our regressor.

However, to keep the idea general in case you want to do regression on more than 1 output logit or if you want to use other loss functions, we have two methods to implement the loss functions

  • Use a Callback
  • Write a custom class that extends Trainer (let's call it RegressionTrainer) where we override compute_loss by torch.nn.functional.mse_loss to compute the mean-squared loss.

We will illustrate with approach 2, which is more straightforward. It reimplements the MSE loss. You can replace the loss with any custom loss function you employ:

Do not forget return (loss, outputs) if return_outputs else loss (two formats of output) as they are required by torch modules.


Everything is ready. We start the training:

Note that the validation loss equals the MSE metrics, although they are implemented in different functions because they refer to the same notion.

Evaluation on Test Set

On the test set, the accuracy is 0.739, also close to the classifier in section 3.


Analysis on Mistakes

Let’s take a look at where the regressor makes mistakes. We will split the test sets into small batches to perform the prediction. Then, we display the (rounded) predicted and correct score in a Dataframe of pandas for better comparison.


We see that: when the model makes mistakes, in most cases, it confuses between close classes (0 and 1, 1 and 2, 3 and 4 but not much 1 and 4 or 0 and 3). We can verify this fact using the confusion matrix: most non-zero items are on the main diagonal and the two neighbour diagonals.


What can we conclude? Although modelled as a regressor, the model also performs well on the classification task with rather good accuracy. In the last section, we present some general observations of these two problems.

5. Classifier vs Regressor

Other experiments

In a client’s project, our team is asked to implement a sentiment scoring task using both classification and ordinal regression models. We also need to try various configurations:

  • Backbone model: CamemBERT and FlauBERT
  • Extensive hyper-parameter tuning: Try as many combinations of learning parameters (learning rate, gradient clip value etc.) as possible.

We performed this task on a 1700-items dataset in which the annotations were validated by at least two annotators. (The labelling of the entire dataset is consistent)

We concluded that the best regressor’s performance is the same as the best classifier (~72% accuracy, 66% for the macro F1). Modelling as a classifier or regressor doesn’t really matter here. In fact, the CamemBERT architecture seems to be the key factor behind this performance.

Inter-convertible Outputs between two Models

In our client’s project, we compared the performance of classification and regression models. In the previous section, we explained how to use “accuracy” as a measure of comparison. However, “accuracy” is a notion related to classification problems, so somehow we are biased toward classification models in this comparison. We can think of the opposite sense: to convert classifiers’ output to regressor’s format. This leads to the problem of inter-convertibility between the two models’ outputs.

For our problem, we can think of some intuitional/natural methods:

  • From the regressor’s output to the classifier’s: Map the predicted score to the closest integer (as we have done so far).
  • From the classifier’s output to the regressor’s: Assume each class is associated with a probability computed by applying softmax on the top layer

In the second approach, we can define either

  • Use argmax strategy: Use the highest class as the regression score.

Regression score = 3.00

  • Use the weighted-sum strategy: deduce the score as the weighted sum of these values:

Regression score = 0.02 * 0 + 0.42 * 1 + 0.08 * 2 + 0.44 * 3 + 0.04 * 4 = 2.06

Note that the weighted-sum strategy is only applicable when there is an order notion between the classes.

More strategies for converting classifiers’ output into regressors’ are presented in [6].

Example of Behaviour of a Regressor and a Classifier on Single Case

A regressor and a classifier may behave differently in case of confusion. Let’s reuse the previous example when we have the following output of a classifier (classes 1 and 3 have the highest probabilities):

This phenomenon happens, for example, when the model faced face two examples like this during training:

{"text": "J'étais admis. Vous êtes content ?", "score": 3},

{"text": "J'étais viré. Vous êtes content ?", "score": 1},

and try to predict a new case:

{"text": "Je suis là. Vous êtes content ?"}

In this case, the regressor typically tries to adapt to give a reasonable distance with the two known examples by moving the final output to something near to 2 (the neutral score). In contrast, the classifier tries to distribute balanced probabilities among classes 1 (negative) and 3 (positive) but does not really pay attention to class 2. If we use the argmax strategy, as usual, there is a risk to misclass the example (unless we define a score threshold (like 0.5) and exclude both classes 1 and 3 as their probabilities are below this threshold).

The regressor’s behaviour seems to be “safer” to avoid positive-negative misclassification but may make the model more dummy as it will avoid giving a sign (positive or negative) when it confuses enough.

When should we try a Regressor or a Classifier?

In summary,

  • Don’t use a regressor if we cannot define an order between the classes.
  • Don’t use a classifier if we want to predict a continuous variable (and we cannot discretise it as in the case of house prices).

We can use both models when we want to predict a discrete numerical variable, or categories that can be sorted in an order.

We may also prefer the ordinal regression approach if the classes are not clearly distinct. For instance, sometimes we may face an example where it’s difficult to decide whether it should be scored 2 or 3. For the regression approach, it is OK to annotate it 2.5 or 2.8, while for classification approaches it is more arguable how to handle this problem.

With the experiments in our client’s projects, so far, the two modellings based on the same backbone language models gave us very similar results, although we are not sure if this fact will still be valid for future issues. Therefore, by this tutorial, we presented a possibility to do ordinal regression tasks with BERT and transformers to help our colleagues and our readers solve future problems when they need to perform the same task.


[1] Attention Is All You Need — Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, Attention is all you need

[2] BERT: Pre-training of Deep Bidirectional Transformers for Language… — Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

[3] Models — Hugging Face

[4] Fine-tuning BERT for a regression task: is a description enough to predict a property’s list price?


[6] (PDF) Regression as classification — Salman, Raied & Kecman, Vojislav. (2012). Regression as classification. Conference Proceedings — IEEE SOUTHEASTCON. 1–6. 10.1109/SECon.2012.6196887. -

[7] Models: FlauBERT , CamemBERT

[8] Dive into Deep Learning — Dive into Deep Learning 0.17.4 documentation (Chapter 3, Chapter 4, Chapter 10)


Thanks to our colleagues Caroline DUPRE and Achille MURANGIRA for the article review.


Nhut DOAN NGUYEN is data scientist at La Javaness since March 2021.



La Javaness R&D

We help organizations to succeed in the new paradigm of “AI@scale”, by using machine intelligence responsibly and efficiently :