This post considers a number of issues with analytic product evaluation method described in the document ODNI Rating Scale for Evaluating Analytic Tradecraft Standards (Rating Scale), the most authoritative guidance for the evaluation of analytic products in the U.S. intelligence community. It considers the following questions:
- Is the method reliable?
- Is the method valid?
- Are the Analytic Standards adequately delineated?
- Should – and how should – an overall score be calculated?
- Is the method efficient?
- How does the method treat expertise?
But first, a brief description of the Rating Scale, since the document describing it (see references at end of this post) does not appear to be publicly available.
About the Rating Scale
As part of a larger effort to raise the quality of intelligence analysis, the U.S. Office of the Director of National Intelligence (ODNI) has provided Intelligence Community Directive 203: Analytic Standards (ICD 203). This document describes a set of Standards for the production and evaluation of analytic products. There are five primary Standards:
- Independence from political considerations;
- Being based on available sources of intelligence information; and
- Implementing and exhibiting the Analytic Tradecraft Standards.
The Analytic Tradecraft Standards are
- Proper description of sources, data and methodologies;
- Proper expression of uncertainties;
- Proper distinguishing of information from assumptions and judgements;
- Incorporating analysis of alternatives;
- Demonstrating customer relevance and addressing implications;
- Use of clear and logical argumentation;
- Explanation of change to or consistency of judgements with previous reports;
- Making accurate judgements or assessments;
- Incorporating effective visual information where appropriate.
The Rating Scale was developed by the ODNI’s Analytic Integrity and Standards Group (AIS) to provide guidance for evaluators assessing the extent to which a product meets the Analytic Tradecraft Standards. It is, basically, a rubric, though intended for use in the workplace rather than in education. For each of the nine Analytic Standards, the Rating Scale provides:
- A definition of the Standard;
- Brief instructions for applying four rating levels: Poor (0), Fair (1), Good (2), and Excellent (3) – though for Standard 8, the levels are Unclear, Conditioned, and Unconditioned; and
- A page or so of further guidance.
It also provides some general guidance. Two aspects of that guidance are of note here. First, the document describes the process raters should use in evaluating a report. Key components are:
- Raters should provide “narrative comments” elaborating on their ratings for each of the Standards.
- Proper use of the scale is a team activity, with each product being first assessed by two raters, who can discuss and agree on a consensus evaluation; and then a quality check by a third rater.
- Evaluations are entered into a database “used to generate statistical information and reports of various kinds.”
Second, like ICD 203, the Rating Scale says that the Standards should be applied in a manner appropriate to the “length, purpose, classification, and production time-frame of each product” and that such factors should be discussed in the narrative comments.
ICD 203, and the Rating Scale, reflect a great deal of wisdom accumulated over many decades about what matters in intelligence generally, and specifically in written products. The Scale was produced by people who know their domain, and manifestly has high “face validity.”
However, constructing a high quality assessment rubric and an associated assessment process is difficult. It generally requires expertise in evaluation itself, not just domain or “subject matter” expertise. I don’t know whether any evaluation experts were involved in the development of the Rating Scale. If none were involved, you would expect the Rating Scale method to have some serious problems. If evaluation experts were involved, you’d still expect the Scale to have some issues, because no perfect evaluation method exists in any complex domain, and such methods always have limitations and make compromises.
Is the Rating Scale method reliable?
A good evaluation method should be reliable. That is, it should give pretty much the same answer each time it is applied to a product. So for example, when the Rating Scale method is used by two distinct teams of raters, those two teams should get close to the same answer. This is called “inter-rater reliability” (IRR) and there are technical ways to measure it.
There appears to have been little research done on the reliability of the Rating Scale. None had been published prior to 2017, when our group, the SWARM Project, started to investigate the issue. Of course, some such research may have been done by the AIS or elsewhere behind the walls of the IC, and not published or otherwise released.
The research done by our group and reported in the article Better Together suggests that IRR is poor when the Rating Scale is used by a single person. Having products rated by individuals working independently, and then combining their ratings, increases IRR for the combined ratings. Interestingly, the research showed that a reasonable level of IRR is reached when you generate combined ratings from three independent raters. You’ll recall that three is the number of raters specified by the Rating Scale method, though the two approaches work a bit differently. There are a number of limitations in our team’s research, described in the article. Overall, it seems fair to say we (outsiders) don’t know how reliable the Rating Scale method is when deployed on real intelligence products in the I.C., though the SWARM research does suggest cautious optimism.
Is the Rating Scale method valid?
A good evaluation method should also be valid, i.e., it should measure what you want it to measure. As mentioned above, the Rating Scale method certainly has face validity; roughly, it seems to be talking about the right thing. However, when deployed on real products by real human raters in real work situations, does the method actually measure adherence to Analytic Standards, or does it end up reflecting other things?
Assessing validity is not easy. The obvious idea is that you should compare the results of applying the method with some gold standard, i.e. some independent measure of the quality of the products. Trouble is, there is no such gold standard. The Rating Scale method is the only measure we have, and so by default the best measure we have, but it can’t be its own yardstick.
Again, there appears to have been no published research on the validity problem prior to 2017, when SWARM started looking at such issues. Some findings from the study described in Better Together point to possible validity problems. The raters in that study, after they had completed their rating work, participated in a focus group and answered a survey. One theme that emerged was that the Rating Scale method appears to encourage a “box-ticking” approach where a rater – particularly, presumably, one who is tired or bored – checks for superficial signs of adherence rather than “deep quality.” For example, a rater is required to assign a rating level on Standard 3, proper distinguishing of information from assumptions and judgements. To do this well requires understanding what assumptions are being made, and this can be quite intellectually challenging. The rater may be tempted to just check that there is some such distinguishing activity, and perhaps award more points for more such activity, with little regard to how “proper” it really is. In other words, the rater may be rewarding form rather than substance. To the extent that this happens, the Rating Scale method would not really be assessing adherence to Analytic Standards, so much as the appearance of such adherence; and the method would be less than fully valid.
In other respects, our team’s initial investigations are more positive. In another study, reported in the article ODNI as an Analytic Ombudsman, evaluations generated using the Rating Scale method were compared with evaluations produced by members of our own team, who used a different approach to assess quality. Our team members were assumed, for the purposes of the study, to be experts and thus able to provide some sort of yardstick, even if not an ideal gold standard. The Rating Scale evaluations were broadly in line with the SWARM expert evaluations.
However various limitations in this academic research, including the difficulty of extrapolating to real use, and the absence of any other research, mean that we really don’t know how valid the Rating Scale method is in practice.
Are the Analytic Standards adequately delineated?
Ideally the criteria on which products are assessed – in the current case, the Analytic Standards – would be clear, crisp, and distinct. The raters in the Better Together study pointed to a number of concerns about this:
- A number of the Standards encompass multiple different issues. For example, Standard 1 requires proper description of sources, data and methodologies. One can easily imagine a product that is meticulous in describing sources and data, but makes no mention of methodology. This can happen because describing sources and data, and describing methodology, seem quite distinct and separable activities. When they are lumped together in one Standard, it becomes difficult for a rater to assign a single rating overall rating adequately reflecting the level of adherence to each component. Perhaps proper description of methodology should be pulled out into a separate Standard. However that strategy would take us down the path of a proliferation of Standards, a longer and more complex Rating Scale, and a more laborious method.
- A second concern is that the guidance provided in the Rating Scale document is, at least for some Standards, not enough to enable a rater to determine what rating level to assign. The Analytic Standards are broad and abstract, and cover very complex territory. When whole textbooks have been written on clear and logical argumentation, and how to assess it – and when those textbooks don’t always agree with each other – it is effectively impossible to provide, in a few pages, guidance sufficient to unambiguously determine which of four rating levels applies in any given case (for anyone who needs such guidance; see the discussion of expertise below).
- Third, matters are made more complex by overlap between Standards. (In the technical parlance, they don’t have full specificity.) For example, a great deal of intelligence is by nature abductive, or “inference to the best explanation.” Good abductive inference inherently involves assessing the merits of alternatives. It therefore seems that Standard 4 – Incorporating analysis of alternatives – is part and parcel with Standard 6, Using clear and logical argumentation. So interpreting and and applying Standard 6 seems to involve interpreting and applying Standard 4. More broadly, this implies – at least in some situations – these two Standards should be collapsed into one. This concern pulls in the opposite direction to the first point just above, which suggested breaking some Standards apart.
These apparent flaws, as well as being concerns in their own right, reinforce the previous concerns about the general reliability and validity of the Rating Scale method.
Note that this issue of delineation is as much a problem for ICD 203, which specifies the Analytic Standards, as it is for the Rating Scale method.
Should – and how should – an overall score be calculated?
The Rating Scale says that a rater should give a product a rating – Poor (0), Fair (1), Good (2), and Excellent (3) – on each of the Standards, and provide narrative comments. Because a number is provided for each rating level, it would be easy enough to add those numbers up to get a single overall quality score. The overall scores for each of the three raters in the rating process could then be (say) averaged to provide a single aggregate quality score.
The Rating Scale document is silent as to whether these kinds of aggregations should be performed. I don’t know whether the AIS computes overall scores. If no overall scores are generated, then the Rating Scale delivers a multi-dimensional quality profile for a report, not a score. Delivering a profile rather than a single score can be a fine and appropriate thing to do, but managers and others will always be sorely tempted to “boil things down” to single scores. I’d be surprised if there wasn’t some kind of aggregation across Analytic Standards going on. Certainly when the SWARM Project was using the Rating Scale in various research exercises, we calculated single overall scores, and we have been reporting those to research participants and in publications.
If you do calculate overall scores, there are various options. The most obvious is to just add up the ratings. Each of Standards 1-7 and 9 have a maximum score of 3 points, so the overall maximum score would be 24. But this approach has a number of problems:
- Standard 8, which has no rating level numbers in the Rating Scale, is left out of the overall score. Yet Standard 8 goes to the very heart of good intelligence (accurate judgements). Indeed the fact that Standard 8 has no numbers can be taken to imply that overall scores should not be calculated. Of course, there is a workaround – just assign numbers to the three levels in Standard 8.
- Simple addition of scores treats all Analytic Standards as equally important, but this is implausible. To take one example, Standard 9 is about incorporating visual information (charts, etc.). While this can be a very good thing to do, it doesn’t seem right that this should be on par with using clear and logical argumentation, or making accurate assessments. More generally, how likely is it that all the Analytic Standards are of exactly – or even approximately – equal importance?
- An alternative is to do a weighted sum of ratings, with the numerical weights expressing the relative importance of the Standards. But this creates two more problems.
- What should the weights be? How do we determine how much more or less important Standard 7 is than Standard 9?
- The relative importance of the Standards will probably depend on the situation. As noted above, the Rating Scale recognises the Standards should be applied in a manner appropriate to the “length, purpose, classification, and production time-frame of each product”. So we would need to apply different weightings in different contexts. Who decides how to adjust the weightings, and what approach do they use? How can this be done without introducing so much noise that the desired aggregate scores and statistics become meaningless?
- Any simple additive approach effectively treats each of the Standards as optional. That is, a product could achieve a high overall score even if it gets a low rating on a particular Standard. To take this to an extreme, a product could make howling errors of logic and yet be excellent in all other regards, and so receive a high overall score. Surely at least an adequate level of logical reasoning should be a minimum requirement, not just another way to gather some points. As Mark Burgman has argued, this could be handled by making the aggregation rule-based, not additive. For example, the aggregation rule may specify that to be Excellent overall, a product has to be Excellent in logical reasoning, and at least Good in all other respects. This seems compelling, but it raises the problem of what the rules should be. The challenge of determining relative importance of Standards re-enters the picture here. (Note: interestingly, the Rating Scale does use a rule-based approach in its guidance for the rating level to assign for each Standard. It is silent on aggregation across Standards, and so doesn’t use a Rule-based approach for such aggregation.)
- Yet another option is to make aggregation multiplicative. That is, instead of adding the rating numbers for each Standard, multiply them. If a report has terrible logic and gets a Poor (0), then the report gets 0 overall. Again, however, the weighting issue (and/or the need for some kind of rules) remains. Should a report really get 0 overall if it uses no visual techniques and so gets a 0 on Standard 9?
Is the Rating Scale method efficient?
Any evaluation method with humans in the loop involves some amount of intellectual work. Ideally, a method will be efficient, in the sense that it usually takes no more work than is really needed to get the desired result, i.e. to deliver an assessment of appropriate quality. So how efficient is the Rating Scale method?
The Rating Scale method is what I’ll call a positive approach. ICD 203 specifies all the major attributes a product should have by way of conforming to Analytic Standards. The Rating Scale method requires a rater to assess the extent to which all these attributes are present. Further, for many Standards, the Rating Scale specifies sub-attributes which should be checked. For example, for Standard 4, the guidance says “Analyst products should identify and assess plausible alternative hypotheses…In discussing alternatives, products should address factors such as associated assumptions, likelihood, or implications related to U.S. interests.” To assess whether a product has done this properly, a rater would need to think through for herself what the associated assumptions, likelihood(s), or implications are, and then check whether the product has addressed those adequately. It seems that the rater’s task is in some ways as large and difficult as the product writer’s task, if not more so. In short, the Rating Scale is constructed to require positive, comprehensive verification that all important qualities are present. This might require a huge amount of work.
A negative approach looks to find specific problems, and deems a product good to the extent that such problems can’t be found. It might be called a “one minus” approach. In the positive approach, a product starts at zero and accumulates points for each positive attribute identified. In the negative approach, a product starts at the maximum, and loses points for specific problems found.
Depending on what you are looking for in an evaluation, a negative approach might be much more efficient than a positive one. If you want the evaluation to provide feedback to the product author on the extent to which they have exhibited every aspect of the Analytic Standards, then a positive approach would be required. But you might just want an evaluation to help you decide the overall quality of a finished product, or to help you determine whether a product needs further revision. Relative to these kinds of objectives, a negative approach might involve much less work, because it doesn’t force the rater to consider every aspect of the product. A competent rater may be able to quickly identify potential problems in the product, and so determine that a product is substandard overall, or in particular respects, without needing to explicitly consider every attribute.
How does the Rating Scale method treat expertise?
This is closely related to the efficiency issue. A large research literature on human expertise indicates that novices and experts think in fundamentally different ways (H. Simon, Ericsson, Dreyfus, Klein, etc.). When experts have had extensive prior experience with suitable feedback, they acquire the ability to intuitively and holistically assess complex situations and quickly focus attention on relevant aspects. Novices, by contrast, have little immediate sense of what matters in a situation, and need to consciously follow explicit rules or guidelines. Since these rules had to be formulated for them in advance, and must be general and comprehensive enough to accommodate many variations on each type of situation, the rules can become quite lengthy and detailed.
The Rating Scale document currently weighs in at 26 pages, with the nine Analytic Standards getting about 2 pages each. This amounts to quite a bit of guidance, but as discussed above, it may not be enough for novice raters. The guidance for each Standard could in principle be further elaborated in much more detail, providing many more steps for such raters to follow. Taken to the extreme, the Rating Scale might be the size of a large textbook, or many textbooks. Arguably, however, the Rating Scale already provides too much guidance for expert raters. Depending on their objective, expert raters don’t need to pay attention to all the issues raised in the guidance, or to follow all the instructions. They can zero in on the critical issues and quickly determine whether and in what respects a product may be deficient, or that a product is good if such deficiencies can’t be found. (By analogy, inspired by Gary Klein’s well-known work, an experienced firefighting captain can judge a situation to be safe if nothing stands out as indicating that it might be unsafe, while a rookie would have to run through checklist of potential issues.)
The fundamental problem is that the Rating Scale has not been designed to accommodate differing levels of expertise. It ends up compromising on a middle course, providing insufficient guidance for novices and demanding too much needless work, and imposing too much constraint, on true experts. It treats all raters as merely competent, which may be fine for many but presumes too much of novices, and interferes with the intuitive expertise of genuine experts.
Klein, G. A. (1998). Sources of power: how people make decisions. Cambridge, MA: MIT Press.
Marcoci, A., Burgman, M. A., Kruger, A., Silver, E., McBride, M., Singleton, T. G., … Vercammen, A. (2019). Better Together: Reliable Application of the Post-9/11 and Post-Iraq US Intelligence Tradecraft Standards Requires Collective Analysis. Frontiers in Psychology, 9, Article 2634.
Marcoci, A., Vercammen, A., & Burgman, M. A. (2018). ODNI as an analytic ombudsman: Is Intelligence Community Directive 203 up to the task? Intelligence and National Security.
Office of the Director of National Intelligence. (2015). Rating Scale for Evaluating Analytic Tradecraft Standards with Amplified Guidance for Evaluators (last revised on 6 November 2015).
Acknowledgements: this post draws heavily the two SWARM articles cited, and discussions with SWARM colleagues, including Ashley Barnett and Mark Burgman.
Interested in a weekly digest of content from this blog? Sign up to the Analytical Snippets list. This blog, and the list, is intended to keep anyone interested in improving intelligence analysis informed about new developments in this area.