Annex B. Performance of the SDG identification algorithm

Several parameters govern the estimation of the boosted tree algorithm. The most important is beta, which parametrises the relative importance in the estimation process of type I errors (false negative) compared to type II errors (false positive).

Almost 80% of the training set is built such that the descriptions of firms’ actions are considered as addressing only one Sustainable Development Goal (SDG), whereas in practice some of the SDGs are highly collinear (Pradhan et al., 2017[1]). As a result, the algorithm tends to be too conservative, and likely to generate too many zeros. To offset this bias, a higher weight is assigned to type I errors (i.e. the loss incurred with a false negative is considered a higher cost compared to the loss incurred with a false positive).

Choosing a value for beta boils down to picking a point on the receiver operating characteristic (ROC) curve. With a beta equal to 1, total accuracy is maximised, but moving slightly to the right of the ROC curve is more appropriate for the problem at hand – reducing false negative – while having a minimal impact on false positives. The final beta value corresponds to 1.2.1 With a higher beta, the result would shift to the right in the curve, and to the left for lower values of beta (Figure B.1).

Other important parameters in the algorithm are chosen through parameter tuning using the validation set. The learning rate, which is the step size shrinkage used in update, is set to 0.3. The maximum depth of a tree (the higher this value the more complex the model) is set to 12. Finally, L1 regularisation is used on the parameters to make the model less prone to overfitting.

For increasing the size of the training set, despite the limited availability of descriptions of corporate sustainability actions descriptions, data augmentation is implemented. New observations are created by inserting new words inside the original text, taking into consideration the surrounding terms. This technique is used in order to reduce the generalisation error (i.e. the error computed on new and unseen observations).

In Table A B.1, performance (measured as precision and recall on the positive class) is summarised. It shows how the algorithm reaches a high level of performance for most of the classes. The performance is under average for classes that represent more general objectives (such as SDG8-Decent Work and Economic Growth, SDG9-Industry, Innovation and Infrastructure and SDG10-Reduced Inequality). Precision and recall are measured on a validation set (10% of the observations are used for validation, but not for the estimation of the model parameters).

To verify the relevance of the information derived from the training set, an experiment was conducted using the training set of Pincet, Okabe and Pawelczyk (2019[2]), composed of international aid projects. Adding these 22 000 examples to our original training set does not improve the performance of the algorithm. Interestingly however, performance is marginally improved when adding only a part of the alternative set of examples, around 40% to 60% of them, rather than the whole set. This can be explained by the fact that when the whole alternative set is added, the more meaningful information from business action examples gets “diluted”. This experiment allows two conclusions to be drawn:

  • The training set used in this report conveys a significant amount of information, despite a limited size, and adding further examples will not provide first-order improvements in the performance of the algorithm.

  • The training set composed of business actions conveys different information from the training set of Pincet, Okabe and Pawelczyk (2019[2]). Gathering examples from business actions is necessary to build an algorithm able to identify the SDGs in texts describing private-sector initiatives.

SDG-BAI over-performs two alternative approaches that have been tested:

  • Fixed vocabulary. The appearance of a predetermined set of words is used to identify the SDGs. Performance is expected to be satisfactory when the classification problem can be solved with simple lexical rules. In the case at hand, SDG-BAI displays a superior performance compared to a fixed vocabulary approach, using the vocabulary developed by Siris Academic.

  • Pre-trained neural network tailored to the dataset of corporate SDG actions. Universal Language Model Fine Tuning (ULMFiT) is a transfer learning method. The weights of the model are computed using not only information from the training set: the properties of the language have also already been captured during pre-training (on external information). Transfer learning is especially effective when the training set is small, and it is necessary to integrate that information from outside sources. However, deep learning methods add parameters and complexity to the task and this is likely to be reason for the inferior performance with respect to SDG-BAI.

The model is used to link the SDGs and the International Sector Industry Classification Revision 4 (ISIC Rev.4). Figure A B.2 displays the mapping between the SDGs and the four-digit sectors of ISIC Rev.4, grouped by mega sector, and after incorporating external information (see section “Incorporating external information to supplement the output of the algorithm”).2

References

[2] Pincet, A., S. Okabe and M. Pawelczyk (2019), “Linking Aid to the Sustainable Development Goals – a machine learning approach”, OECD Development Co-operation Working Papers, No. 52, OECD Publishing, Paris, https://dx.doi.org/10.1787/4bdaeb8c-en.

[1] Pradhan, P. et al. (2017), “A systematic study of Sustainable Development Goal (SDG) interactions”, Earth’s Future, Vol. 5/11, pp. 1169-1179, https://doi.org/10.1002/2017ef000632.

Notes

← 1. Beta is set to 1 for SDG 3 because, based on out-of-sample predictions, the algorithm produced too many false positives with beta = 1.2.

← 2. In contrast, Figure 3.1 to Figure 3.4 present the outcome of SDG-BAI, before the incorporation of external information.

Metadata, Legal and Rights

This document, as well as any data and map included herein, are without prejudice to the status of or sovereignty over any territory, to the delimitation of international frontiers and boundaries and to the name of any territory, city or area. Extracts from publications may be subject to additional disclaimers, which are set out in the complete version of the publication, available at the link provided.

© OECD 2021

The use of this work, whether digital or print, is governed by the Terms and Conditions to be found at http://www.oecd.org/termsandconditions.