In the previous post I raised various issues with the ODNI Rating Scale for intelligence products. At least some of these are serious problems. Some time in the future ODNI might put out a new and improved version of the Rating Scale, but it is hard to imagine that a new version would be a radical departure. Partly this is because in many ways the Rating Scale is very well suited to its task, and partly because change is usually slow and incremental anyway.

Still, it is worth asking what a radical alternative might look like. As it happens, my SWARM colleague, Ashley Barnett, has been looking into these issues. After reviewing many reasoning evaluation methods, he had the idea of a “stress test” approach, or more generally what we are now calling a “negative” method.

Most if not all existing evaluation methods, such as the ODNI Rating Scale, are “positive” in the sense that they present a set of general standards (attributes, criteria, virtues) which, ideally, will all be met by a product. A rater has to check the extent to which every one of these standards is met. A product accumulates merit (e.g., points) to the extent that each one is met.

In a negative method, a rater looks for specific problems rather than the positive satisfaction of criteria. A product is deemed good to the extent that problems cannot be identified, or equivalently, it is bad to the extent that problems are identified. A product loses merit in proportion to the number and severity of the problems. 

Our group is in early stages of developing a negative method for evaluating the quality of reasoning (QofR) in written texts such as intelligence products. This method essentially asks raters to identify and describe specific types of reasoning flaws. The overall QofR in the product is then a function of the number and severity of the identified flaws.

The Flaws method, as we’re currently calling it, potentially has a number of advantages over positive methods.

  • Depending on what you expect a method to deliver, the Flaws method (or more generally a “negative” method) might be much more efficient than a positive method. Rather than checking conformity to every standard, a rater can “zero in” on specific problems, and quickly conclude that a product is poor. Also, it may be easier to confirm the presence of specific flaws than it is to verify the level of conformity to a general standard. This efficiency may have flow-on benefits such as reducing fatigue and boredom.
  • The Flaws method focuses rater attention on specific aspects of reasoning, and this may mitigate the impact of confounding factors such as clarity of communication or knowledge of the correct answer to a reasoning problem.
  • The Flaws method delivers a list of specific flaws in the reasoning, which can be translated into a set of actions for improve a product.

Generally speaking the Flaws method consists of a rubric at the heart of a process supported by a cloud platform.

The rubric is a guide for a single rater to assess the QofR in a single product. It would consist of

  • A list of specific types of reasoning flaws, such as selecting on the dependent variable, or mis-estimating the reliability of a source. These of course can be organised into categories, such as flaws concerned with the treatment of sources. Each flaw can have an extensive entry including a detailed description, a checklist for verifying the presence of the flaw, examples how this flaw occurs in products of various kinds, etc..
  • Instructions for properly identifying a flaw – e.g. providing a description, a location in the product (if possible), and an estimate of the severity of the flaw.
  • Rules or guidelines for deriving an overall assessment of the QofR in the product in terms of the number and severity of identified flaws. For example, a single “fatal” flaw would be sufficient to result in an overall quality rating of Poor. The rubric would specify and explain various levels of severity.

The process specifies how raters, aided by the rubric, generate a QofR assessment for a product. The simplest process would just be a single rater following the rubric. However a more complex process may deliver better results. So the process would specify things like:

  • The number of raters involved, and their roles
  • The qualifications and/or training required for raters
  • How raters interact, if at all;
  • How raters’ contributions are aggregated into a single overall assessment
  • How raters are evaluated, if at all.

The platform supports the process, particularly if the process allows raters to participate remotely or asynchronously. A good platform can:

  • Reduce effort for individual raters
  • Support a complex process, e.g. set up interactions among raters for a particular product
  • Reduce administrative burdens by e.g. automatically collating inputs
  • A very sophisticated platform might use analytics to further enhance the process. For example, certain types of products might typically have certain flaws. An adaptive platform can become better at guiding raters’ attention to such possibilities. Or it might have mechanisms to help incentivise raters to perform well.

Our group is not (at least yet!) trying to develop a genuine alternative to the ODNI Rating Scale. That would be a much larger task. For one thing, quality in an analytic product has many more dimensions than quality of reasoning in our sense. Still, we are beginning to speculate that there might be something like a Flaws method (or more generally a negative method) which might, in some circumstances at least, be more useful than the ODNI Scale.

Interested in a weekly digest of content from this blog? Sign up to the Analytical Snippets list. This blog, and the list, is intended to keep anyone interested in improving intelligence analysis informed about new developments in this area.

Photo by Miguel A. Amutio on Unsplash