3. Looking ahead: A roadmap of datasets to enhance the fraud risk model of Spain’s Comptroller General

This chapter offers a roadmap for complementing existing grants data of the General Comptroller of the State Administration (Intervención General de la Administración del Estado, IGAE) in order to improve risk assessment models. By implication, it outlines priority datasets which can be linked to existing IGAE grants data, enhancing analytical sophistication and improving the precision of risk assessment. As discussed in Chapter 2, machine learning models are limited by the scope and type of data included in the training sample. The model cannot precisely estimate risk probabilities based on incomplete information, because key drivers and mechanisms determining risks remain unaccounted for. Hence, the more comprehensive the initial dataset is, the more precise and accurate risk calculations become.

As the universe of potentially relevant datasets is vast, it is imperative to narrow down the list of datasets to the most relevant ones before investing considerable resources into data mapping, processing, linking and eventually incorporating into the predictive models. Three factors should be considered when selecting suitable datasets: accessibility, relevance, and quality. Accessibility in this context encompasses the ease with which the dataset can be gathered from its original source, which can include questions such as whether the dataset is publicly downloadable or it has to be requested. The format in which the data are available is also crucial, such as a single downloadable dataset or a series of HTML pages. Relevance refers to the potential of the data fields to improve analytical sophistication and precision. This has to be assessed before actually collecting the data. The ultimate test of this initial assessment is whether the data would improve the predictive accuracy of the model. When too many redundant variables are included, the final model may suffer from overfitting. Data quality in this context captures the rate of non-missing values and the reliability of information. Low quality data with many missing values or inaccurately collected data are likely to bias the results. This chapter will only cover the datasets that are considered to be readily available to the IGAE, relevant for the said risk model and of sufficiently high quality.

The two previous chapters outlined the process by which machine learning can be deployed to enhance the IGAE’s approach to identifying risks in grants and subsidies provision. The process of drawing on external datasets in addition to the existing internal data follows the same logic. First, background and risk indicators should be defined for each dataset to identify factors that potentially influence fraud risks. The next step is to link datasets to the existing internal dataset. In order to do so, a few things should be taken into consideration: the unit of analysis in each dataset, variable relevance, the missing rate and the variance. As discussed in Chapter 2, the missing rate should be lower than 50%, with variance of at least 35%. Moreover, to merge the new data it should be aligned to the same unit of analysis with unique IDs to avoid duplicative rows after matching. Variables that do not contain useful information (i.e. cannot be used as indicators) should be dropped.

For example, to add external datasets to the existing National Subsidies Database (Base de Datos Nacional de Subvenciones, BDNS), they should have identifiers matching with the ones in the BDNS data. Such IDs include identifiers of grants, Tax Identification Number (NIF) of beneficiaries and grantor names, such as municipality names. This implies some limitations, for example, it is currently impossible to match third parties by their names, and instead they can be matched only by NIFs. Additionally, matching by municipality will lead to a significant data loss, because aligning data to the same unit of analysis with unique IDs means that risk scores should be aggregated by municipality. Similar logic applies to matching by grantors’ names and beneficiaries’ NIF, as there are many identical values throughout the BDNS data (i.e. the same beneficiary might receive multiple grants or subsidies).

There are a few sources—some more reliable than others—that can be potentially used for adding data to the existing BDNS dataset. First, there are official sources such as the National Register of Associations (el Registro Nacional de Asociaciones) of the Ministry of the Interior, which lists accredited non-governmental organisations (NGOs), the tax database of the State Tax Administration Agency (Agencia Estatal de Administración Tributaria, AEAT) and the Spanish Association of Foundations (La Asociación Española de Fundaciones), which lists accredited foundations. Some of the data are publicly accessible, whereas others are restricted only to authorised agencies.

Beneficial ownership (BO) registries and public procurement data can also be considered as trusted official sources. The advantage of working with official data directly obtained from data holders is that there is no need to verify the information provided, beyond the standard data quality checks used as part of the outlined data pipeline. Official aid data from the European Union is another example of trustworthy data.

The next group of sources are independent NGOs and associations. This information is less reliable, since the process of data collection and verification is unclear. While official sources most likely include primary data and information, secondary sources are either parsed from different sources or collected manually, often without transparency concerning how the dataset is constructed. Therefore, these datasets should be used with greater care and their validity checked more thoroughly. In Spain, examples of such sources include independent NGO evaluators as well as FICESA, a database of Spanish senior positions and secretariats.

There are four major groups of data that are relevant for matching with the main BDNS database in order to enhance the IGAE’s fraud risk assessments. Each group can provide insights on distinctly different dimensions and determinants of fraud risks. Some data creates opportunities for alternative methods of analysis, such as network analysis, revealing connections between private companies and politically exposed persons, as well as beneficial owners and associated companies. Bringing all of these datasets together offers the possibility of the most comprehensive risk assessment; however, matching only some, or even just one additional dataset, can be very useful for enhancing the IGAE’s risk model, including the following groups of data:

i. Organisational data on the parties of the granting process. This group covers data on grantors and grantees, as well as third parties (i.e. project implementers). Potential sources of information for this group are:

  • Company registry and financial information: provides information on the organisational structure and history of the company (e.g. when it was founded) and also uncovers the financial situation such as profitability of the organisation.

  • Organisational data on accredited NGOs, foundations, associations: provides information on the registry features, reliability of the organisation, and financial records.

ii. Data on personal connections and conflicts of interest. This group can be helpful in identifying connections between officials in private organisations applying for grants and political officeholders overseeing grant giving. Connecting public and private office holders can be useful for further investigating possible conflicts of interest. Potential sources of information for this group are:

  • The BO registry: can help with identifying beneficial owners, associated companies and their records.

  • Politically exposed persons: helps in revealing people who are entrusted with power and are more susceptible to being involved in bribery or other corrupt practices.

  • Data on senior positions and secretariats: provides names of people potentially connected to private companies through legal or beneficial ownership.

iii. Data on organisational reliability and violation of rules. This group can aid in predicting fraud risks by offering insights on relevant, but only indirectly related violations, such as tax payment irregularities. This group can also provide information on softer measures of reliability, such as civil society accreditation. Potential sources of information are:

  • Data on bankruptcy or tax payments: shows the reliability of an organisation based on past financial records:

  • Accreditations of NGOs: identifies accredited NGOs or other associations as more reliable ones.

iv. Data on other funds and contracts. Information on other funding sources and public contracts can reveal additional factors that influence the likelihood of fraud, such as double funding for the same activity. Moreover, corruption risks in public procurement or other funding processes can point to systematic, organisation-level weaknesses and the propensity to commit fraud. The relevant datasets in this group include:

  • EU Funds: list of beneficiaries of EU aid can show if the organisation received double funding from different sources for the same project.

  • Public procurement: corruption risks in public contracts received by organisations or provided by the same grantor can influence the possibility of wrongdoing in grants and subsidies.

Table 3.1 presents the most promising datasets in Spain which are either publicly accessible or their content and specifications are in the public domain. For each dataset belonging to one of the 4 dataset groups, the table contains information on the unit of measurement (i.e. what does a single row refer to), number of observations where available, ID for matching to the BDNS,1 and the priority for the IGAE’s follow-up work. The table highlights the top priority datasets on the top, considering the three main dimensions of data assessment discussed above: accessibility, relevance, and quality. Only datasets that scored high on all 3 dimensions—readily accessible bulk data download, highly relevant data scope and content, and adequate quality—were considered as high priorities for the IGAE.

Conversely, some datasets that scored high on only one or two dimensions were rated as medium or low priority. For instance, when data accessibility was limited, the priority was set to medium even for data that were otherwise seen to be highly relevant or of adequate quality. Ranking datasets in terms of overall priority sets the detailed roadmap for extending and enriching the current IGAE dataset and the risk model described in Chapter 2. The next sections discuss each of these datasets in detail, along with some fraud risk indicators, which can be calculated when data are matched.

Organisational data for the parties involved in grant making include the grantors, grantees and third parties (i.e. project implementers). Matching data on organisations allows for gaining a more complete and detailed picture of organisational controls of fraud risks. It helps to identify additional organisational characteristics that might influence the probability of sanctions. For example, accounting information, size of the company and associated companies can all be useful characteristics for identifying fraud risks and improving the IGAE’s risk model in the future. This group includes the following databases: the National Company Registry (Registradores de Espana), data from the Spanish Association of Foundations (la Asociación Española de Fundaciones, AEF), and the National Register of Associations (el Registro Nacional de Asociaciones) of the Ministry of Interior.

One of the most relevant datasets for the IGAE’s purpose and for enhancing the risk model is the National Company Register. It contains data on companies' details, capital, representatives (e.g. directors and attorneys), registered acts and filing of annual accounts (i.e. financial performance). The list of variables are presented in Table 3.2.2

The National Company Register can be matched to the main BDNS dataset by the company’s NIF number, or if that is erroneous, by the name of the organisation. Almost all data fields contained in the company dataset are relevant for the IGAE in terms of enhancing its risk model. These fields range from essential registry information, such as date of incorporation or location of headquarters, to balance sheets and income statements. Similarly, recent changes in equity and the full list of members of the company can provide additional insights on potential conflicts of interest when matched with other datasets.

With regards to essential registry information, some red flags have proven to be useful for predicting corruption and fraud risks. For example, companies which have been set up, or whose registration data has been modified shortly before applying for a grant, are higher risks. Similarly, companies registered in so-called “company graveyard” addresses can be high risk, where a very large number of companies are registered with high degrees of fluctuation (e.g. thousands of companies created and closed on the same address each month). Similarly, as discussed in Chapter 2, the type of organisation (i.e. the company legal status), as well as its overall income and size, can influence the level of fraud risks. For example, due to legislation, certain types of organisations can be less transparent or more loosely regulated (e.g. trusts or company ownership presented by bearer shares).

Regarding company financial data, the IGAE could consider a number of relevant indicators for risk prediction. First, the ratio between a company's expenditures and income can provide information as to whether the company is profitable. Companies that are not profitable are riskier beneficiaries of grants and subsidies, since they may use funds to repay their debts as opposed to financing their projects. Similarly, a negative ratio between a company's liabilities and assets suggests greater risk in terms of the appropriate use of grants. Frequent changes in equity might be a signal of internal conflicts and instability within the company, increasing the level of risks associated with grants and subsidies for such organisations. Systematic decrease in cash flows reflects stagnation or reduction in the company's activities, which also brings its reliability into question. Combining the grants data with company financial data also can reveal the relative size of the grant compared to the company, with small companies receiving large grants potentially being risky.

Another organisational dataset that the IGAE could consider for its risk model, although a low priority, is the National Register of Associations (el Registro Nacional de Asociaciones), held by the Ministry of Interior. This is a list of organisations that have passed a review made by the Spanish Agency for International Development Cooperation (Agencia Española de Cooperación Internacional para el Desarrollo, AECID) in which more than 70 qualitative and quantitative criteria were used, mostly related to experience, financial solvency, transparency and human resources. The main limitation of this dataset is the small number of accredited NGOs it provides, as it only has 44 observations. They are stored in HTML format and can be easily transformed to excel or any other data formats. The list of the variables are described in Table 3.3.

The dataset provides two potential IDs for matching—the name of the organisation and its Customer Identification Number (CIF). Both can be used to link the data to the IGAE’s grant data. The data consists of three variables, two of which are IDs and one specifies the exact sectors in which the NGO is qualified to operate. Based on this information, two binary variables can be created: 1) whether the NGO has been reviewed, and 2) whether the NGO is acting in the same area as it was qualified for (e.g. the NGO was qualified for the health sector, but receives grants for the education sector). Due to a low number of observations, significant changes in predicted risk scores are unlikely. However, if the main BDNS dataset is filtered for NGOs only, this information might influence the outcomes for this sector.

The third dataset worth considering is that of the Loyalty Foundation (Fundación Lealtad). This is an independent NGO evaluator, which analyses the management, governance, use of funds, economic situation, volunteering and transparency of NGOs. On the foundation’s website, there is a downloadable PDF file with the list of all positively evaluated NGOs. However, this list has limited information beyond name of organisations. Therefore, a more effective approach would be to access the HTML pages of each organisation and parse data manually. There is a possibility to parse information from standardised PDFs called “full reports” for each NGO. The list of variables are described in Table 3.4.

The main IDs by which organisations can be linked to the IGAE’s datasets are name of organisation and NIF. While name is available in both HTML and PDF files, NIF is stored in the full report PDF. Data on income, expenses, sector of activities, year of origin, as well as number of beneficiaries, partners and employees can add to the background information for the analysis. As before, a binary variable can be created reflecting whether the given organisation is verified by the Fundación Lealtad. Besides the general background information, some additional indicators can be extracted from this dataset. For instance, the ratio of expenses should be taken into consideration to assess how much is spent on administration of the NGO in comparison to its mission. High spending on administration might be a signal for higher risk scores, although on its own would not be an indicator of fraud or wrongdoing. Administrative bodies when linked to other datasets (e.g. politically exposed persons) can provide information on potential conflicts of interest.

The second group of datasets that could enhance the IGAE’s risk model, described in Chapter 2, is data on personal connections and conflicts of interest. Matching data on personal connections between the public and private sectors opens up the possibility for tracking conflicts of interest. Such data can be analysed with the use of network analysis to identify if there are connections between politically exposed persons and owners of the companies receiving grants and subsidies. Some potential sources were already discussed in the previous group. The next sections will focus on the Beneficial Ownership Registry and FICESA, the database of Spanish senior positions and secretariats.

The BO registry provides information for over 5 000 000 organisations registered since 2009. The short list of variables is provided in Table 3.2. There is no complete dataset in the public domain, but the source—an online platform for consulting and analysing the Official Gazette of the Mercantile Registry (Boletín Oficial del Registro Mercantil) called LibreBOR—provides API and Python script to parse the data.3 It is possible to select those organisations that appear in the IGAE datasets, without parsing the whole dataset, which will make for a more efficient processing time.

There are two ways for the IGAE to match the BDNS datasets to the BO registry: 1) by name of the organisation, or 2) by NIF of the beneficiary. Alternatively, it is possible to aggregate data per province and match aggregate numbers (e.g. average company size) by particular location. The BO dataset contains a lot of background information for organisations, but the most relevant one is management positions, associated organisations, and the final beneficial owners. The ownership data is best used when matched against other datasets, in particular, lists of political office holders (see next section).

In addition, the IGAE can use some of the background information as risk predictors on their own. When the names of beneficial owners of grant recipients is matched against public office holders, it is possible to identify either direct conflicts of interest (i.e. when the official works for the granting body itself) or indirect forms of potential conflict (i.e. when the related political office holder works in a higher level or supervisory body to the granting organisation). When looking at the ownership data on its own, the information on companies associated with the grantee can reveal risks if further matched to other datasets (e.g. complex forms of conflicts of interest and related risk factors).4

The next source is a database of Spanish senior positions and secretariats called FICESA. This source contains data related to senior public officials in a wide range of public organisations: state secretariats, undersecretaries, general directorates and sub-directorates, budget offices, official offices, as well as different judicial bodies for state, regional and local levels. There is no data in the public domain, and data must be requested from the data holder by filling out a form. Therefore, the format of the data and the variables the dataset contains is unclear. There was no response to attempts to contact the source. It is assumed that the IGAE would be able to gain access to the full database as a bulk download.

The only ID by which this dataset can be linked is names and, if available, additional personal features, such as date of birth. If the BDNS dataset contains data on beneficial owners, as described above, the data on official positions can be linked by persons’ names. Linking the IGAE’s datasets to the information on senior office holders creates the possibility to conduct network analysis and see if there are conflicts of interests between private organisations receiving grants and public bodies giving grants. It is particularly useful to use the BO registry in order to find all the associated organisations, and analyse if they are connected to politically exposed persons. For instance, the organisation receiving the grant is not connected to anyone from official bodies, but one of its related organisations could be.

Datasets with information about organisational reliability and violations of rules or laws is the third group of data that could support the IGAE to strengthen its risk model for assessing grant fraud risks. This group was covered partially above in the section about data on accredited NGOs. In addition, in this group, there are datasets on bankruptcy and taxation. Matching data on organisational reliability and violation of rules illuminates new dimensions of fraud risks relating to other domains. These datasets can help predict fraud risks in grants by exploiting correlations between accredited organisations’ trustworthiness, rule following behaviours (tax debts, bankruptcy, etc.) and fraud in grants. Building on previous discussions, the next section focus on the Public Bankruptcy Registry, AEAT’s tax data and accounting data from CINCOnet.

The first dataset in this group, identified previously as a medium priority for the IGAE, is the Public Bankruptcy Registry (El Registro Público Concursal). The source includes information on procedural resolutions, bankruptcy and out-of-court settlements. The data can be parsed from HTML after filtering by province or court. Unfortunately, for unknown reasons, filtering does not work on the site properly, leading to page errors. Yet, the approximate list of variables is presented in Table 3.6.

This dataset can be matched to the IGAE’s grants data by either name of the organisation, or NIF/CIF number. The source does not provide an opportunity to look through all the cases, requiring filtering beforehand, so the easiest way to set a filter is to use province. The most relevant information for fraud risk assessments are the details on bankruptcy. The source provides location, name of organisation, court, judge and NIF/CIF or other identifiers of organisations. Unfortunately, there is no information on the date of bankruptcy proceedings, which would be especially important to analyse past grants and subsidies. After matching, the most relevant risk indicator for the IGAE would be the binary variable (‘flag’) reflecting if the grantee was or is currently in the state of bankruptcy. Such bankruptcy information on an organisation might signal that the awarded grant or subsidy will be misused by the beneficiary, or at the very least, inadequately administered due to other organisational pressures.

The second dataset on rule violations is data from the State Tax Administration Agency (Agencia Estatal de Administración Tributaria, AEAT). This is a dataset with restricted access and only aggregated statistics are available in the public domain. Once again, for the discussion below, an assumption was made that the IGAE can obtain full access to the database in order to incorporate such data into its risk model. According to the notes the AEAT published, it has data in a disaggregated format which can be provided upon request. Aggregated data covers filing of tax returns, payment of taxes, debts and fees, tax certificates, consult tax return, etc.

Due to restricted access to the datasets it is uncertain whether the IDs are the same as in the BDNS dataset, but most likely organisations can be matched either by name or by NIF of the beneficiary. Information on timely payment of taxes, debts and fees are the most relevant for enriching predictive models on fraud risks. Late payment of taxes, as well as presence of debt in a given organisation (or associated ones) could be a signal of higher risks.

The third dataset belonging to this group is accounting and budgeting data from CINCO.net, deemed a high priority for the IGAE and improvements to risk model. The data includes expense operations and total expenditure amount in the current year, revenue amount in the current year, cash flows, non-budgetary operations, third-party expenses, general data of third parties, etc. Like the AEAT’s data, this data is not in the public domain; however, the Ministry of Finance and Civil Service (Ministerio de Hacienda y Función Pública) manages CINCO.net and the IGAE has direct access to it.

The organisations in this database can be matched by names or NIF of the beneficiary to the BDNS. Yet, due to restricted access of the data, it is difficult to assess the quality and content of matching variables. Besides general background information on revenues and expenditures, CINCO.net provides data on reimbursement of other grants provided by different organisations in Spain. This can be particularly useful in assessments of potential risks in future subsidies and grants provision, such as double-funding of operations or the large value of grants received compared the revenue.

The final group of datasets encompasses a diverse group of data on public contracts and other grants and funding. Matching data on other funds and contracts would allow the IGAE to cross-reference spending as well as develop additional risk dimensions. For example, it can help identify cross-subsidisation for the same activities, which should be considered a risk factor. Public procurement contracts received by a company can be scored using corruption risk indicators and then related to grants risks. For example, a company or agency (third party, grantor, grantee) participating in high-risk tenders might also be risky when it comes to grants. This group includes datasets from the Spanish Association of Foundations (la Asociación Española de Fundaciones, AEF), European Union Funds, and public procurement data.

AEF’s data provides information on foundations giving grants, including their types of activity, geographical areas, type of beneficiaries, date of constitution and origin of their administrative bodies. The list of the variables is presented in Table 3.7. The data is open access and can be easily downloaded in excel or PDF format. In total there are 15 840 foundations covered by the directory.

Matching this dataset to the BDNS requires several steps. First, all the observations should be filtered by type of beneficiary, using the online filtering, since the type of beneficiary is not a data field in the downloadable file. Second, the particular location should be matched to the locations of grantors or grantees. This will not provide the exact information as to whether the beneficiary received another grant from a certain foundation, but it indicates the presence of the foundation in the same location with the same types of beneficiaries.

The most relevant information for the IGAE to assess risks would be whether any of the beneficiaries were double granted for the same activities. To precisely track such risks requires checking the exact beneficiaries by their IDs, yet this source does not provide such detailed information. Hence, only aggregate information, which is much more imprecise, can be used from this source. The presence of a foundation supporting similar activities in the same locality (province) as grantor or grantee increases the probability of being double funded.

The next relevant dataset for the IGAE to consider matching to the BDNS data, as a medium priority, is data for EU Funds. The Spanish government and the European Commission provide the data, and they cover records from 2007 to 2020. The data are easily accessible and can be downloaded in Excel format. The list of relevant variables is presented in Table 3.8.

The data provides a VAT number as an ID for organisations, which can be transformed into a NIF number by removing the first two letters. Alternatively, names of organisations can be used for matching. Number of budgetary commitments, subject of grants or contracts, as well as project start and end dates are particularly relevant to identify whether the grantee received funding from the EU for the same project as its Spanish grant. Double funding is a fraudulent practice when the same project is funded more than one time by different donors, without providing information on contributions made. Therefore the project might be implemented, yet the extra public money disbursed is not used as intended.

The last data source the IGAE could consider matching with its datasets is national public procurement data. The opentender.eu portal contains this data collected from two official government sources (Ministerio de Hacienda y Función Pública and Plataforma de Contratación), as well as Tenders Electronic Daily (TED), a European online public procurement portal. The data contains all the publicly available information on tenders, contracts, bidders, buyers and suppliers necessary for calculating the Corruption Risk Indicator (see Box 3.1). The list of relevant variables is presented in Table 3.9.

Suppliers IDs are the same as the grantees’ NIFs, therefore this ID can be used for linking data. Alternatively, names of organisations as well as grantors names can be matched to the buyers or suppliers from procurement dataset. To assess if the procurement contracts won by bidding firms or tenders run by public sector grantors are prone to corruption, information on corruption proxies can be used. For example, single bidding on competitive markets, procedure type used, publication of the call for tenders, length of bid advertisement and decision period, as well as connections between supplier and procurement authority. Collating public procurement corruption risks in the procurement activities of grantees or grantors can shed additional light on grants fraud risks as it is expected that organisations that are risky in one domain will also be risky in a related domain. This logic of analysis is empirically demonstrated in Box 3.1.

This chapter offered a detailed account of how and why different datasets can be linked to existing IGAE datasets with particular attention to promising fraud risk indicators enabled by the new data. These new indicators principally capture actor behaviour rather than simple background characteristics allowing for a far more precise risk assessment. However, data linking not only allows for calculating new indicators in one database and linking them to each other, but also for creating new indicators by drawing on multiple datasets. Such complex indicators offer additional insights on relevant risk dimensions. They also represent a more robust measure of actor behaviour, because multiple sources pointing at the same behaviour carry greater validity than a single dataset.

Drawing on multiple datasets is crucial for comprehensively mapping complex fraud behaviours, as well as for reducing the rate of false positives that are common in simple models (Fazekas, M., Ugale, G, & Zhao, A., 2019[2]). Combining multiple indicators stemming from different datasets is considered as good practice in risk measurement as it allows for measurement triangulation. In other words, it allows for increasing convergent validity. False positives are pervasive in simple risk assessments, as many indicators merely point at potential wrongdoing rather than actual bad deeds. Moreover, widely used indicators of conflicts of interest typically indicate the presence of a potential conflict rather than an actual conflict that represents abuse of a situation for undue personal gain. However, when conflicts of interest information is combined with data on outcomes, such as double-counting grants or anomalous financial performance, the combination of indicators provide greater validity to the measurement approach.

Matching datasets representing multiple dimensions of relationships can also power the use of advanced, multi-layer network analytics. Such multi-layered relationships can encompass connections between private companies and public grant making organisations through a range of contractual relationships, or links between companies’ beneficial owners and politically exposed persons working in public sector bodies. Multiple network connections established through the use of large-scale, linked administrative datasets also allow for tracking temporal changes in connections across potentially risky entities and individuals, thereby increasing the analytical sophistication of risk modelling.

This section has reviewed a wide variety of potential useful additional datasets to the existing IGAE dataset. By doing so it set out a roadmap of data capture and matching maximizing analytical value for IGAE. Of the reviewed datasets, company information on registration, ownership and financials represents the highest potential for further refining the fraud risk assessment model. These datasets can be readily matched to IGAE’s internal data using company registry IDs. Moreover, matching public procurement data to grants data, also demonstrated by analysing readily available datasets, can add great value as 2 sets of risk factors can be triangulated against each other producing more reliable risk assessment. Once these high priority datasets are brought into the IGAE data pipeline, further datasets can also be considered such as the bankruptcy register.


[2] Fazekas, M., Ugale, G, & Zhao, A. (2019), Analytics or Integrity: Data-Driven Decisions for Enhancing Corruption and Fraud Risk Assessments, OECD Publishing, Paris, https://www.oecd.org/gov/ethics/analytics-for-integrity.pdf.

[1] Fazekas, M. and G. Kocsis (2017), “Uncovering High-Level Corruption: Cross-National Objective Corruption Risk Indicators Using Public Procurement Data”, British Journal of Political Science, Vol. 50/1, pp. 155-164, https://doi.org/10.1017/s0007123417000461.


← 1. In some cases, certain information is presumed to be present in the IGAE’s datasets; however, confirmation of this was not possible because of anonymisation of most of the databases.

← 2. The access to the dataset is restricted and requires paying a fee for each organisation and receiving a digital certificate. Free access is only allowed to the aggregated data per sector, year or business sector. The only company-level information available without additional restrictions is company status (i.e. operational or not). For the IGAE to use this data, it would need to gain full access to the complete and current dataset, either through paying the bulk access fee or setting up a special arrangement with the government data provider. Easy access, public alternatives also exist, for example, opencorporates.com, which is a private, social enterprise aiming to make all company data easily accessible around the world.

← 3. See https://docs.librebor.me/python/.

← 4. Due to a restricted access to the source, it is not clear if the information on beneficial owners is there. Yet, it is present in the company register; therefore, it is reasonable to expect that it also contains a variable in LibreBOR. In case it is not, the information can be obtained from the company register after receiving an electronic certificate.

Metadata, Legal and Rights

This document, as well as any data and map included herein, are without prejudice to the status of or sovereignty over any territory, to the delimitation of international frontiers and boundaries and to the name of any territory, city or area. Extracts from publications may be subject to additional disclaimers, which are set out in the complete version of the publication, available at the link provided.

© OECD 2021

The use of this work, whether digital or print, is governed by the Terms and Conditions to be found at http://www.oecd.org/termsandconditions.