In the course of a few days last week, two prominent political personalities from different parties, White House Press Secretary Tony Snow and Elizabeth Edwards, wife of Democratic Presidential candidate John Edwards, announced that their cancers (breast cancer in the case of Edwards and colon cancer in the case of Snow), after having apparently been successfully treated two years ago had recurred and were now metastatic. One of the issues that comes up whenever famous people announce that they have cancer is the question of early detection and why we don’t detect tumors earlier. Indeed, Amy Alkon, a.k.a. The Advice Goddess, asked this very question in the comments of one of my posts linked above:
As for so many cancers — can you talk about why they get as far as they do? Are there any advances being made in detection?
It’s a common assumption (indeed, a seemingly common sense assumption) that detecting cancer early is always a good thing. Why wouldn’t it always be a good thing, after all? It turns out that this is a more complicated question than you probably think, a question that even many doctors have trouble with, and in this post and a followup, I’ll try to explain why.
It turns out that, the very week after learning of Mrs. Edwards’ announcement that her breast cancer had recurred in her bone, meaning that her cancer is now stage IV and incurable, we read for our journal club a rather old article that still has a lot of resonance today. The article, written by William C. Black and H. Gilbert Welch and entitled Advances in Diagnostic Imaging and Overestimations of Disease Prevalence and the Benefits of Therapy, appeared in the New England Journal of Medicine in 1993, but could easily have been written today. All you’d have to do is to substitute some of the imaging modalities mentioned in the article, and it would be just as valid now, if not more so. Indeed, this article should be required reading for all physicians and medical students.
The article begins by setting the stage with the essential conflict, which is that increasing sensitivity leads to our detecting abnormalities that may never progress to disease:
Over the past two decades a vast new armamentarium of diagnostic techniques has revolutionized the practice of medicine. The entire human body can now be imaged in exquisite anatomical detail. Computed tomography (CT), magnetic resonance imaging (MRI), and ultrasonography routinely “section” patients into slices less than a centimeter thick. Abnormalities can be detected well before they produce any clinical signs or symptoms. Undoubtedly, these technological advances have enhanced the physician’s potential for understanding disease and treating patients.
Unfortunately, these technological advances also create confusion that may ultimately be harmful to patients. Consider the case of prostate cancer. Although the prevalence of clinically apparent prostate cancer in men 60 to 70 years of age is only about 1 percent, over 40 percent of men in their 60s with normal rectal examinations have been found to have histologic evidence of the disease. Consequently, because the prostate is studied increasingly by transrectal ultrasonography and MRI, which can detect tumors too small to palpate, the reported prevalence of prostate cancer increases. In addition, the increased detection afforded by imaging can confuse the evaluation of therapeutic effectiveness. As the spectrum of detected prostate cancer becomes broader with the addition of tumors too small to palpate, the reported survival from the time of diagnosis improves regardless of the actual effect of the new tests and treatments.
In this article, we explain how advances in diagnostic imaging create confusion in two crucial areas of medical decision making: establishing how much disease there is and defining how well treatment works. Although others have described these effects in the narrow context of mass screening6,7 and in a few clinical situations, such as the staging of lung cancer, these consequences of modern imaging increasingly pervade everyday medicine. Besides describing the misperceptions of disease prevalence and therapeutic effectiveness, we explain how the increasing use of sophisticated diagnostic imaging promotes a cycle of increasing intervention that often confers little or no benefit. Finally, we offer suggestions that may minimize these problems.
Note that now CT scans and MRI now routinely “section” people into “slices” much thinner than 1 cm, making our imaging sensitivity considerably higher than it was 14 years ago. What the essential conflict is, at least in the case of cancer, is that far more people have malignant changes in various organs as they get older than the number of people who actually ever develop clinically apparent cancer. The example of prostate cancer is perhaps the most appropriate. If you look at autopsy series of men who died at an age greater than 80, the vast majority (60-80%) of them will have detectable microscopic areas of prostate cancer if their prostates are examined closely enough. Yet, obviously prostate cancer didn’t kill them. They lived to a ripe old age and died either of old age or a cause other than prostate cancer.
Now, imagine if you will, that a test was invented that was nearly 100% sensitive and specific for prostate cancer cells and, moreover, it could detect microscopic foci less than 1 mm in diameter. Now imagine applying this test to every 60 year old man. Somewhere around 40% of them will register a positive result, even though only around 1/40 of those apparent positives would actually have disease that needs any treatment. Yet, they would all get biopsies. Many of them would get radiation and/or surgery simply because we can’t take the chance, or because, in our medical legal climate, watchful waiting and observation to see if it is going to grow at a rate that would make it clinically apparent in the case of potential cancer are a very hard sell, even when they’re the correct approach. After all, we don’t know which of them has disease that actually will threaten their lives. It may well be that eventually using expression profiling (a.k.a. gene chip) testing, something that did not exist in 1993, will eventually allow us to sort this question out, but in the meantime we have no way of doing so.
Of the most common diseases, cancer is clearly the disease that is most likely to be overdiagnosed as our detection abilities, either through increasingly detailed imaging test or through blood tests, become ever more sensitive. Breast cancer is the other big example other than prostate, but I plan on holding off on that one until Part 2 of this series, simply because the recent guidelines for MRI screening released last week, plus a NEJM paper on that topic, plus the recent case of Elizabeth Edwards, make it an opportune time to give the topic its own post to discuss the new MRI recommendations in light of the considerations of this post. So instead I’ll look at another example from the article thyroid cancer. Thyroid cancer is fairly uncommon (although certainly not rare) among cancers, with a prevalence of around 0.1% for clinically apparent cancer in adults between ages 50 and 70. Finnish investigators performed an autopsy study in which they sliced the thyroids at 2.5 mm intervals and found at least one papillary thyroid cancer in 36% of Finnish adults. Doing some calculations, they estimated that, if they were to decrease the width of the “slices,” at a certain point they could “find” papillary cancer in nearly 100% of people between 50-70. This is not such an issue in thyroid cancer, which is uncommon enough that mass screening other than routine physical examination to detect masses is impractical, but for more common tumors it becomes a big consideration, which is why I will turn to breast cancer in the next post.
The bottom line is that the ever-earlier detection of many diseases, particularly cancer, is not necessarily an unalloyed good. As the detection threshold moves ever earlier in the course of a disease or abnormality, the apparent prevalence of the disease increases, and abnormalities that may never turn into the disease start to be detected at an increasing frequency. In other words, the signal-to-noise ratio falls precipitously. This has consequences. It leads, at the very minimum, to more testing and may lead us to treating abnormalities that may never result in disease that affects the patient, which at the very minimum leads to patient anxiety and at the very worst leads to treatments that put the patient at risk of complications and do the patient no good.
This earlier detection can also lead to an overestimation of the efficacy of treatment. The reasons for this are two types of bias in studies known as lead time bias and length bias. In the case of cancer, survival is measured from the time of diagnosis. Consequently, if the tumor is diagnosed at an earlier time in its course through the use of a new advanced screening detection test, the patient’s survival will appear to be longer, even if earlier detection has no real effect on the overall length of survival, as illustrated below:
Unless the rate of progression from the point of a screen-detected abnormality to a clinically detected abnormality is known, it is very difficult to figure out whether a treatment of the screen-detected tumor is actually improving survival when compared to tumors detected later. To do so, the lead time needs to be known and subtracted from the group with the test-based diagnoses. The problem is that the use of the more sensitive detection tests usually precede such knowledge of the true lead time by several years. The adjustment for lead time assumes that the screening test-detected tumors will progress at the same rate as those detected later clinically. However, the lead time is usually stochastic. It will be different for different patients, with some progressing rapidly and some progressing slowly. This variability is responsible for a second type of bias, known as length bias.
Length bias refers to comparisons that are not adjusted for rate of progression of the disease. The probability of detecting a cancer before it becomes clinically detectable is directly proportional to the length of its preclinical phase, which is inversely proportional to its rate of progression. In other words, slower-progressing tumors have a longer preclinical phase and a better chance of being detected by a screening test before reaching clinical detectability, leading to the disproportionate identification of slowly progressing tumors by screening with newer, more sensitive tests. This concept is illustrated below:
The length of the arrows above represents the length of the detectable preclinical phase, from the time of detectability by the test to clinical detectability. Of six cases of rapidly progressive disease, testing at any single point in time in this hypothetical example would only detect 2/6 tumors, whereas in the case of the slowly progressive tumors 4/6 would be detected. Worse, the effect of length bias increases as the detection threshold of the test is lowered and disease spectrum is broadened to include the cases that are progressing the most slowly, as shown below:
The top image represents an idealized example of disease developing in a cohort of patients by two different hypothetical tests, the first one being the less sensitive standard test and the next one being the “advanced” test, which has a lower threshold of detection. The cases detected by the more sensitive advanced test are represented in the stippled area. The standard test detects only the cases that are rapidly progressive. However, the new test detects all cases, including the ones that are slowly progressive and, if left alone, would not have killed the patient, who would have died from other causes before the tumor became clinically detectable by the “standard” test. These latter two patients would be at risk for medical or surgical interventions that would not prolong their lives and carry the risk of morbidity or even mortality if subjected to the more sensitive test. This is one reason why “screening CT scans” are usually not a good idea, as I wrote about before a long time ago.
As the authors state:
Unless one can follow a cohort over time, there is no way of accurately estimating the probability that a subclinically detected abnormality will naturally progress to an adverse outcome. The probability of such an outcome is mathematically constrained, however, by the prevalence of the detected abnormality. The upper limit of this probability can be derived from reasoning that dates to the 17th century, when vital statistics were first collected. If the number of persons dying from a specific disease is fixed, then the probability that a person with the disease will eventually die from it is inversely related to the prevalence of the disease. Therefore, given fixed mortality rates, an increase in the detection of a potentially fatal disease decreases the likelihood that the disease detected in any one person will be fatal.
In other words, early detection makes it appear that fewer people die of the disease, even if treatment has no effect on the progression of the disease. It will also make new treatments introduced after the lower detection threshold takes hold appear more effective:
Lead-time and length biases pertain not only to changes that lower the threshold for detecting disease, but also to new treatments that are applied at the same time. Whether or not new therapy is more effective than old therapy, patients given diagnoses with the use of lower detection thresholds will appear to have better outcomes than their historical controls because of these biases. Consequently, new therapies often appear promising and could even replace older therapies that are more effective or have fewer side effects. Because the decision to treat or to investigate the need for treatment further is increasingly influenced by the results of diagnostic imaging, lead-time and length biases increasingly pervade medical practice.
There is another complication that this increased detection can lead to that wasn’t discussed in the paper, a phenomenon known as stage migration. This is a phenomenon that occurs when more sophisticated imaging studies or more aggressive surgery leads to the detection of tumor spread that wasn’t noted before using earlier means. This leads to what is colloquially known in the cancer biz as the Will Rogers phenomenon. Will Rogers once made a famous joke: “When the Okies left Oklahoma and moved to California, they raised the average intelligence level in both states.” What in essence happens is that technology results in a migration of patients from one stage to another that does the same thing, except for prognosis. Consider this example. Patients who would formerly have been classified as, for example, stage II cancer (any cancer), thanks to better imaging or more aggressive surgery, have additional disease or metastases detected that wouldn’t have been detected in the past. They are now, under the new conditions classified as stage III, even though in the past they would have been classified as stage II. This leads to the paradoxical statistical effect of making the survival of both groups (stage II and III) appear better, without any actual change in the overall survival of the group as a whole. This paradox comes about because the patients who “migrate” to stage III tend to have a lower volume of disease or less aggressive disease than the average stage III patient and thus have a better prognosis. Adding them to the stage III patients from before thus improves the apparent survival of stage III patients as a group. The converse is that patients with more disease that was previously undetected, tended to be the stage II patients who would have recurred; i.e., the worst prognosis stage II patients. But now, they have “migrated” to stage III, leaving behind stage II patients who truly do not have as advanced disease and thus in general have a better prognosis. Thus, the prognosis of the stage II group also appears better.
Does all of this mean that we’re fooling ourselves that we’re doing better in treating cancer? Not at all. It simply means that the question of sorting out “real” effects from new treatments from spurious effects due to these biases is more complicated than it at first seems. For one thing, it points to the importance of carefully matching any experimental groups in clinical trials according to stage as closely as possible using similar tests and imaging modalities to diagnose and measure the disease. These factors are yet another reason why well-controlled clinical trials, with carefully matched groups and clear-cut diagnostic criteria are critical to practicing evidence-based medicine. It also means that sorting out lead time bias, length bias, and the Will Rogers effect from whether there is actually a better effect from new treatments can be a complex and messy business. If we as clinicians aren’t careful, it can lead to a cycle of increasing intervention for decreasing disease. At some point, if common sense doesn’t prevail (and in the present medical-legal situation, it’s pretty hard to argue against treating any detectable cancerous change), it can reach a point of ever diminishing returns, or even a point where the interventions cause more harm than good to patients. The authors have similarly good advice for dealing with this:
Meanwhile, clinicians can heed the following advice. First, expect the incidence and prevalence of diseases detectable by imaging to increase in the future. Some increases may be predictable on the basis of autopsy studies or other intensive cross-sectional prevalence studies in sample populations. Others may not be so predictable. All types of increases should be expected. The temptation to act aggressively must be tempered by the knowledge that the natural history of a newly detectable disease is unknown. For many diseases, the overall mortality rate has not changed, and the increased prevalence means that the prognosis for any given patient with the diagnosis has actually improved.
Second, expect that advances in imaging will be accompanied by apparent improvements in therapeutic outcomes. The effect of lead-time and length biases may be potent, and clinicians should be skeptical of reported improvements that are based on historical and other comparisons not controlled for the anatomical extent of disease and the rate of progression. Clinicians may even consider that the opposite may be true — i.e., real outcomes may have worsened because of more aggressive interventions.
Finally, consider maintaining conventional clinical thresholds for treating disease until well-controlled trials prove the benefit of doing otherwise. This will require patience. A well-designed randomized clinical trial takes time. So does accumulating enough experience on outcomes from nonexperimental methods that can be used to control for the extent of disease and the rate of progression. From the point of view of both patients and policy, it is time well spent.
These words are just as relevant to day as they were 14 years ago. Unfortunately, it is very difficult to convince patients and even most physicians that, if we can detect disease at ever lower thresholds that we shouldn’t and that if we can treat cancer at ever earlier time points or ever smaller sizes that we shouldn’t. For some tumors, clearly we need to do better at early detection, but for others spending ever more money and effort to find disease at an earlier time point will yield ever decreasing returns and may even lead to patient harm. It is likely that each individual tumor will have a different “sweet spot,” where the benefits of detection most outweigh the risks of excessive intervention.
ADDENDUM: Here’s part 2: Early detection of cancer, part 2: Breast cancer and the never-ending confusion over screening.