Member: Society Free
YourSRI: Topic of the month January 2020 - ESG ratings: Why can't raters agree?
Sustainable finance, after years of advocacy to become mainstream, is now growing significantly. According to one measurement, at the end of 2018 there were already some 18 trillion US dollars invested according to ESG integration approaches, an increase of 69 % versus the end of 2016.1
With this tailwind, rating agencies that assess ESG factors to help investors make informed decisions on sustainable investing are booming, with more than 125 different agencies established world-wide.2 These raters assess a number of different metrics, adding their own proprietorial magic for how to aggregate, weight, and come up with an overall number or grade. Akin to a credit rating score, this might give the impression of a consensus-drawn evaluation derived from hard facts and defensible figures, but these grades mask layers of subjectivity and hidden biases. In fact, approaches, and therefore results, of ESG raters differ widely as the chart illustrates.
Source: Vontobel Asset Management as of November 15, 2019. Company universe based on rater 1 universe.
Recent academic research performed similar analysis more broadly, finding a correlation coefficient of around 0.493 when comparing the scores of different leading ESG raters. To put this into context, this is contrasting to a coefficient of 0.964 (indicating strong agreement) for credit rating agencies, where of course the industry landscape and approaches are much more consolidated, also because of the longer history of such ratings.
The research confirms that ESG rating agencies neither agree on what constitutes good ESG practice nor who is good or bad at it. Particularly, there was a stark disagreement in the tails of the ratings (very good and very bad companies), which is notable as many investors use these results to create best-in-class portfolios or avoid worst-in-class performers.
One underlying problem is that ESG raters serve various responsible investing interests (see our white paper Navigating ESG5 for the reasons for ESG investing and how to find the right ESG approach for your beliefs, and our white paper Evolution of Sustainable Investing and the case for integration6 for deeper background on ESG investment strategies). In practice, the raters usually go about the rating process by developing proprietary methodologies to rank and score companies on the panoply of ESG issues.
As input, ESG raters take data from multiple different sources and languages and use models to clean, organize, and weight these diverse data points to create comparability and to flag risks. The scoring models used by ESG raters of course have their merits by giving structure to decision making, but they also are at risk of giving the impression of scientific rigor, when in fact ESG practice is still an art. In the case of ESG ratings, they come with many challenges.
10 challenges of ESG ratings
1. Material factors
Considers what ESG topics should be included in the model, e.g., while greenhouse gas emissions will be commonly assessed, indigenous rights, employee organizations, or lobbying might be more niche topics for assessment and only scored by a few. The number of data points evaluated by raters vary from 10 – >400, although there is good evidence that counting too much merely weakens the real signal aimed for.7
Raters use different metrics to evaluate a topic, e.g., to evaluate employee health and safety, raters choose from 20 different data points to score this topic.8 Some research found this to be the dominant reason for rater divergence.9 Peeling back the layers of what gets measured, the raw underlying data is more inconsistent than you might think.
3. Data quality
Related questions are: how defensible is the ESG data? Is it pure marketing information, as non-financial information is not required to be certifi able or defensible in the same way that fi nancial statements are? Frequently, metrics supplied by companies are patchy, inherently backward looking, and tend to fall into “good news” storytelling. Some raters exclude data provided by the company itself, while this can naturally be a rich data source. Similarly, as ESG metrics are frequently qualitative, raters must choose how they interpret and score descriptive matters.
4. Gaps treatment
It is common for companies not to report on all indicators (let alone provide industry comparable metrics). Different statistical tools can be used to fill the gaps with widely different outcomes.10 Interestingly, a few studies found larger firms experience more disagreement in their scores suggesting again that more data points can lead to more disagreement between raters. An active investor with good relations with the firm can sometimes overcome data gaps by direct dialogue.
5. Timing aspects
The frequency with which raters evaluate a company can have a material bearing on discrepancies between scores. An annual review is not uncommon, but also time gaps of two years between the latest updates of different raters may exist.
6. Rater bias
The rating houses have a natural (sometimes outspoken) slant, e.g., a focus on best-in-class, risk, momentum, and climate. It has been observed that raters based in civil-law countries (e.g., Germany and France)are more focused on social issues, whereas common-law countries (e.g., the UK and US) have a shareholder-centric approach and therefore have higher focus on governance issues.11 In addition to explicit biases (which are reflected in the materiality assessment), research has shown an unexplained or unconscious “rater effect”, in that when a rater is generally positive (or negative) on a company this is reflected across the board, including on unconnected indicators. This could account for 14 – 18 % of rater disagreement.12
7. Weighting methodology
Next, raters need to assign how much importance to give an indicator in their model. This is largely subjective and not always transparent. Most models have indicators with little to no statistical significance – meaning they are being scored without having any real impact on the overall ESG score (or any link to financial performance).13
8. Controversy handling
Controversy handling is the walk of the sustainability talk, and for many raters they have a high prominence in scoring. To be comparable, controversial incidents have to be evaluated for impact on society and for the business – once again an open field for subjectivity and disagreement.
As the rater translates the scoring into a final rating, an important input is also the perspective taken.
Relative scoring is commonly used to benchmark performance against peers. But this raises the question – what is the right peer group? Universal comparisons or against the industry peers (there are merits for both)? If the latter, again, raters choose from different industry classification systems, such as GICS, BICS, IVA industries, or perhaps an in-house division of industries. Then throw into the mix how to treat diversified companies, and no wonder a leader in one classification can be only average in another rater’s eyes. Additionally, relative scoring can of course miss the point on sustainability if the entire industry is not addressing the issue well enough.
Absolute scoring is the alternative approach and scores on preset ranges or optimal levels. Subjectivity creeps in on who sets the benchmark and then this leads to natural tilts away from certain industries or countries, which commonly underperform in certain areas, e.g., diversity in the financial sector or on Chinese boards.
10. Aggregation of ratings
Portfolios are also scored on their average ESG rating. In truth, the average fund scores tend to be tightly clustered in a narrow spread, therefore, a top-rated fund may not have an average score notably ahead of a weak fund. At this fund level the aggregated score is even further removed from the underlying raw data and are now in black-box territory in terms of what the scores really ought to tell you – how exposed you are to risks and whether those risks have been adequately priced in.
How to sail around these challenges?
A deafening demand across the ESG industry is for companies to supply better quality and more comparable data. This should address a major reason for disagreement amongst raters. There are various voluntary industry and legal initiatives14 working to create a common set of metrics on which all companies should report on. Another way to mitigate the problem, a new wave of artificial-intelligence-driven ESG ratings are being designed to overcome human unconscious biases and normalize for size and industry skews. Other major trends are increasing use of unconventional data sources15 to get more impartial risk insights as well as consolidation within the rating industry. The major raters have been on a land grab in the last few years buying up smaller, niche players, suggesting a consolidation on ESG theorization may emerge. However, at the same time, sellside analysts have entered the space adding alternative views.16
An active, high-conviction manager should look beyond aggregated ratings
For the thoughtful investor, this disillusion with ratings requires looking beyond frameworks and adopting a multi-layered approach. To start with, use informative data from the ESG raters to feed an own in-depth assessment to enrich fundamental equity analysis. A step-by-step process of investigation leads to a much more detailed and holistic understanding of a company: its flaws and beauty spots but always focusing the few issues that are really material to that company. This detailed appreciation of the top ESG risks that can impact performance is much more informative to an active investor than the specific score crunched out at the end of the rater’s model. The real goal is to use ESG information to understand if the company in question has the ability to withstand its top risks in a one to five-year time frame. Still, at some point you want to aggregate your findings on a portfolio level and this is when you have to make sure to not lose details when zooming out. One way to go about it is to visualize the findings on a stock level in a tile chart which is an aggregation of the more detailed company-by-company ESG risk assessment. This way, risk concentrations are easy to spot, without losing the important details on where exactly those risks come from.
Sources and Comments:
- Voorhes, 2018.
- Voorhes, 2018.
- This is the average of the mean correlation of the following four papers. Bender, et al., 2018 found correlation between four leading raters ranged from 0.47 to 0.76 with an average of 0.59. Gibson, et al., 2019 found average correlation between six prominent raters was 0.46. Berg, et al., 2019 found a correlation range of 0.42 to 0.73 with an average of 0.61 in their assessment of fi ve leading ESG raters. Chatterji, et al., 2016 had the lowest mean correlation of 0.3 for six well-known raters (with a range from –.012 [indicating severe disagreement] to 0.67, and only a quarter of the correlations were higher than 0.5).
- Berg, et al., 2019.
- Plinke & Münstermann, 2019.
- Hammerich & Kesterton, 2018.
- The Sustainability Accounting Standards Board (SASB) is leading the charge on addressing this with its endeavor to create consensus on material ESG issues for each industry and sub-sector.
- Kotsantonis & Serafeim, 2019.
- Berg, et al., 2019, Chatterji, et al., 2016.
- E.g. do you assign the industry average (or universal or home market peer group average) or score with lowest score or use some other statistical model or not score at all? Kotsantonis & Serafeim, 2019 examines this in detail.
- Gibson, et al., 2019.
- Berg, et al., 2019.
- Berg, et al., 2019.
- EU Non-Financial Reporting Directive has required ~6,000 EU companies to publish ESG data since 2017 annual results. Plenty of other regulatory requirements come from stock exchanges (UNSSE, ESMA); international and domestic law (e.g. legislation in discussion under EU Action Plan, French Article 173, China mandatory ESG disclosure by 2020); principles frameworks (i.e. ICMM, TCFD, SDGs, GRI, UN Global Compact); or voluntary disclosure frameworks (SASB, GRI, CDSB). The alphabet soup is discussed further in Temple-West, 2019.
- E.g., geographic information systems data (e.g., for real estate at risk), loyalty scores and customer reviews, independent product recall data, supply chain mapping, non-government organization reports, employee review sites and many more.
- Naumann, 2019.