The recently released Cass Review Final Report (Cass Review) has criticized the absence of “high quality evidence” supporting the use of puberty blockers to treat transgender youth (as well as in other areas of transgender research).

The systemic reviews performed as part of the Cass Review applied a “modified” version of something called the Newcastle-Ottawa Scale (NOS). A brief review of several of the reviews (there are several of them) performed by the Cass team mention “modifying” the NOS, but they do not disclose the nature of the modifications made. Broadly speaking, they classify the vast majority of studies as “low quality”, while the final report spends quite a bit of time talking about “double blind” studies as the “gold standard” for high quality data.

Let’s talk about that a bit further, shall we? ( This will be one of several posts on the Cass Review Final Report)

What Does Data Quality Mean?

The issue of the “quality” of research studies is a significant issue in all human-centred domains of research, and the reliability of studies is influenced by a wide range of factors, ranging from study design to the number of participants, and the statistical analysis of these. The core issue is one of establishing whether or not the results can be generalized beyond the study itself.

This is the first point where the Cass Review is misleading. They use the term “low quality” without actually providing a framework within which the casual reader might understand the concept. “Low quality” is better understood as “difficult to generalize beyond the study group”, rather than “bad data”.

Tiers of Study Design

In terms of study design, there are a wide range of clinical study types out there, and we have to understand them for the contribution they can make to the knowledge base in a given domain. Some study designs are more generalizable than others, some do not generalize at all, but provide important information that can guide future research. When we are working with transgender people, the intersection of physical and mental health concerns weighs heavily, and study design has to deal carefully with this - it's not solely about the physiological effects of treatment.

Statistical Generalizability

A key aspect of studies is the ability to "generalize" them. Generalizability refers to whether or not the particular findings can be applied more broadly than the study group. Like all such matters, this is a matter of degrees, driven in large part by the number of participants (sample size), and to a lesser degree by the design of the study itself and the instruments used to assess.

At the very “bottom” of the data quality heap are what are called “Case Studies”. These are studies that focus on the characteristics of an individual. They are often the first time that a unique situation is identified and contain a discussion of the treatment path that was used. These can be incredibly important in identifying gaps in existing studies which can be used to inform future research efforts. But a single Case Study really doesn't apply to the broader community, and even a large number of Case Studies isn't going to generalize very well.

Then we come to small sample studies. These often come about because a clinic notices a group of its clients all seem to be experiencing similar symptoms and someone undertakes to start quantifying things. With small sample sizes, the depth of statistical analysis possible is quite limited, and the results may or may not apply more broadly, but they can provide important insights into client groups that would get lost in larger sample sizes. These studies almost always have significant issues in their samples related to demographic characteristics among other factors that render them nearly impossible to project onto a larger population.

Random Control Trials (RCT) are a specific type of study design where participants are randomly assigned to the proposed treatment group, or to an alternative (often referred to as "Treatment As Usual" (TAU), or placebo). Ideally, this blinds the participant to whether or not they are receiving the novel treatment or not.

The so-called "gold standard" of a double blind study where neither the participant nor the researcher knows who is receiving what are most commonly found in areas like pharmacology where it makes sense. Then it becomes easier to ensure that the results are not biased by either the participant's expectations or those of the research team. But you can't always apply a double blind (or even single blind) model for a variety of reasons (we'll come back to this point).

Ethics and Study Design

But, you can always make a "gold standard" study, right? Well ... no. Not really.

You cannot just treat the public like they are lab rats. You need this little thing called consent - in other words, just because you have a client in front of you that fits your planned study, it doesn't mean that you can automatically include them as a participant. That's the first problem study designers face.

Then we come to the ethics of study designs such as "novel treatment versus placebo". In contexts where the placebo is literally "no treatment", we have to be careful ethically for several reasons. First, are we denying access to treatment to the patients receiving placebo? If we are talking about a headache that will likely resolve itself, giving people a sugar pill instead of the novel treatment is probably relatively neutral as a decision. However, what if the condition being treated is more serious? Let's say it's a novel medication for a chronic condition that if left untreated, can be debilitating. Can you ethically withhold treatment by providing a straight placebo as an alternate to the novel treatment? Chances are the answer here is "no - not if there is a known effective alternate treatment".

Sometimes, you are working with a situation where the effects of the treatment are so pronounced to the participant that they will be able to deduce quite easily if they are receiving treatment or not. This is absolutely the case when we are talking about medical interventions for transgender people.

For example, let's talk about puberty blockers. The effects of these medications are so pronounced that the participant will know in fairly short order whether they are getting treatment or placebo. Further, it's not like there is an alternative treatment here - the option is treatment, or withholding treatment - which potentially is harmful in a number of ways. This would make any Randomized Control Trial (RCT) both impractical, but also arguably unethical.

Rating Scales

Typically, these various scales like the NOS for rating data quality look for a range of things. Factors like study design, sample size, controls (a "control" here is often a characteristic that is used to normalize the data against), the degree to which the sample is "representative", and so on. The idea is that when you are working with data that isn't necessarily ideal, you want to use the "highest quality" (most clinically rigorous) possible.

They come out of criticisms, in particular in domains like psychology, where reproducibility of results has been repeatedly flagged as a problem. To some extent, this is reasonable, but we should be somewhat cautious of the biases inherent in these scales. In some respects, it's a bit like applying methodologies appropriate to mathematical proofs to a domain filled with random things like people. Like all tools, it's really important to understand their purpose and applicability. For example, while both a scalpel and a bow saw cut things, you don't want to use the bow saw when conducting surgery.

In looking at the NOS, I didn’t see anything in it that evaluated the study design against the characteristics of the study group, and the ethical considerations. This would make it very easy to conclude that studies were “weak” data quality, when in fact those studies are quite reasonable within the practical constraints of the context.

This is this first of my concerns with the Cass Review. I don't see any evidence in it that they chose to use tools appropriately. What do I mean by "appropriately"? Well, in a situation where you are working with a very small population, like the transgender community, you cannot simply apply the scale to the study in the absence of contextual analysis.

The Transgender Community and Studies

Using the 2021 Canada Census, the transgender community is less than 1% of the population. Of that, only a fraction is going to be seeking active treatment at any given moment in time. This means that a "good sized sample" for a study may be around 50, and if you happen to be really lucky, you might get up to 100. Naturally, using scales like the NOS, this is going to turn up issues with matters like generalizability. It's one of the problems with blindly applying an instrument without appropriate analysis of what constitutes a "good sample".

Another aspect of the picture here is the long, often sordid, history of "research" involving the transgender community. To put it kindly, a lot of the community is very wary of becoming involved in studies which they perceive as invasive.

The Cass Review

When looking at the Cass Report, we have to realize that their analysis effectively dismisses the bulk of extant research on the transgender community by simply declaring much of it "low quality". This does a great disservice to the validity of that work.

By emphasizing "gold standard" models in research, rather than putting it in a more complete context, the Cass Review's authors mislead readers into thinking that this "low quality" data somehow renders the findings invalid. In effect, they are demanding a standard of testing that is impractical with the population.

I’m going to go out on a limb here, and I suspect strongly that the people tasked with performing the “systemic reviews” that underpin the Cass Review aren’t particularly experienced in working with the transgender population. That would have made it fairly easy to convince them that using the NOS was perfectly reasonable, when in fact it is questionable at best.

The Cracked Crystal Ball II

Friday, April 19, 2024

Let’s Talk About Data Quality For a Moment