PsychEducation.org (home)

Bipolar Screening Tools (MDQ, BSDS): Validation Studies 

  1. Sensitivity and specificity in 4 studies
  2. Why these varying results? Does it matter?
  3. Which data were used in the analysis for primary care?
  4. Making minor changes in the MDQ test might improve its performance

Test Performance Data

1. Sensitivity and Specificity

Study

Sensitivity Specificity
MDQ # 1.  Adult outpatient psychiatry 
      Hirschfeld et al 2000
0.73 0.90
MDQ # 2.  General adult population 
      Hirschfeld et al 2003
0.28 0.97
MDQ #3.   Adult outpatient psychiatry 
      Miller et al 2004 (in press)
0.58 0.67
BSDS #1. Adult outpatient psychiatry
      Ghaemi et al 2004 (in press)
0.76 0.85

(There is a fifth study by Isometsa et al, discussed below, but the authors acknowledge several reasons why their sensitivity/specificity data should not be used to analyze the test overall). 

2. Why these widely varying results?  Does it matter which results one uses?  

The very low sensitivity in Study #2 has raised concern.  Indeed, the validity of the MDQ has been seriously challenged by Zimmerman and colleagues from Brown University, in a Commentary not available online; here are more details.  In brief, they argue that the 0.28 sensitivity is so low that the MDQ should not be used as a screening tool in clinical practice. 

However , it turns out that the low sensitivity has suprisingly little impact on predictive values.  It may not really matter which of the above numbers is the most accurate.  This is not intuitive, so let's look closely.  Here is an analysis of predictive values at low to high prevalence rates, as in the main essay on MDQ interpretation in primary care.  These calculations have been reviewed for accuracy by Dr. Zimmerman.  

                                       MDQ #1:  Hirschfeld '00  ( sens 0.73, spec 0.90)                  MDQ  #2:. Hirschfeld '03  ( sens 0.28, spec 0.97)

    

                                      MDQ #3: Miller   (sens 0.58, spec 0.67)                                BSDS #1:  Ghaemi (sens 0.76, spec 0.85)    

(Is this right?  How could #3, with a better sensitivity than #2, look so much worse? Here's some help with that question)

Two points emerge from these graphs: 

A.  At low prevalence rates, negative predictive value is high regardless of which data one uses, or even which test.  To be precise: 

Test NPV at prevalence 0.1
MDQ Hirschfeld '00    ( sens 0.73, spec 0.90)  .97
MDQ Hirschfeld '03    ( sens 0.28, spec 0.97) .92
MDQ Miller               (sens 0.58, spec 0.67) .93
BSDS Ghaemi            (sens 0.76, spec 0.85) .97

B.  At low prevalence rates, positive predictive value is low in all cases, even when sensitivity and specificity are relatively high. 

Some conclusions which seem justified from these two points: 

(A) The MDQ and the BSDS will be useful and accurate when used to "rule out" bipolar disorder when the doctor did not suspect it anyway -- e.g. if used to address the FDA's call for bipolar screening prior to antidepressant prescription.   The likelihood of "missing" bipolar disorder is low when the doctor's clinical suspicion is low (presuming there is some basis for that opinion).  For low-prevalence "rule out" purposes, the sensitivity/specificity data are not worrisome. 

(B) Similarly, at low prevalence rates (i.e. low "prior probability" estimates, low clinical suspicion) false positives are going to be common -- regardless of which test or sensitivity/specificity data are used.   Primary care doctors are not likely to be able to distinguish between prior probabilities of 0.1 and 0.2.  That would require being able to tell that the patient represented a population in which ten percent of the group, all of whom are depressed, have bipolar disorder; versus a depressed population in which twenty percent of the group has bipolar disorder.  By contrast, distinguishing between "ten percent"  and "fifty percent" is perhaps realistic, thus the main analysis for primary care.  Yet shifting prior probability from 0.1 to 0.2 will shift the positive predictive value 20%, just as much as shifting from the worst sensitivity/specificity data to the best.  It appears that primary care doctors are going to need some help dealing with positive test results, at least those they did not anticipate --  regardless of which data set we decide is most accurate. 

3. Which data were used in the analysis for primary care?

The original Hirschfeld et al study from 2000 was used.  This study was thought closer in design to a primary care clinical practice than the other study available at that time, the general population screening by Hirschfeld et al 2003. The study by Miller et al was not available at that time.  Hopefully the analysis above demonstrates why re-working the primary care analysis using Miller et al, which may be the most reliable dataset in some ways, is relatively moot.  Those would wish to use the Miller et al data should take note of their results when question # 3 was ignored (sensitivity 0.78, while specificity dropped from 0.67 to 0.64).  

4. Making minor changes in the MDQ test might improve its performance

Note how the predictive values which emerge in the Miller study, graph #3 above, are much worse than Hirschfeld '03, even though the sensitivity is much better.  The difference is due to the lower specificity.  But Miller et al note that comorbidities, particularly substance use, accounted for many of the false positives.  Several practitioners have suggested including "while not on alcohol or drugs" in question #1 of the MDQ.  This might improve specificity, while probably not lowering sensitivity.  Accordingly, the MDQ one can download from this site incorporates that suggestion.  

One other change in use of the MDQ, which can be done without significantly altering the test itself (and thereby opening up an entirely new set of questions about sensitivity and specificity properties) is simply to alter the scoring by accepting "minor" as well as "moderate" and "severe" on question three.  Indeed, three research groups have found that the test performs better if one ignores question 3 entirely. Miller et al note that this question accounted for half of their false negatives.  They found that ignoring question three raised the MDQ sensitivity from 0.58 to 0.78.  Further, they found that this improvement in performance was most pronounced in their Bipolar II subgroup, a group likely to be overrepresented in primary care populations being screened prior to antidepressant treatment.  

Similarly, Isometsa et al lowered the cut-off so that "minor" problems constitute a yes answer on question three (not just moderate or severe problems), and found this raised both sensitivity and specificity.   This modification allowed use of an even higher cut-off of eight positives on question 1, without sacrificing sensitivity, as shown in blue.  

Number of
"yes" answers
on question 1

Standard Cut-Off

Lowered cut-off on Question 3

Sensitivity

Specificity

Sensitivity

Specificity

0

1.00

0.00

1.00

0.00

1

0.90

0.18

1.00

0.06

2

0.90

0.18

1.00

0.06

3

0.90

0.18

1.00

0.06

4

0.85

0.18

0.95

0.06

5

0.85

0.18

0.95

0.11

6

0.85

0.29

0.95

0.24

7

0.85

0.47

0.90

0.41

8

0.85

0.59

0.90

0.59

9

0.75

0.59

0.80

0.59

10

0.50

0.77

0.55

0.77

11

0.35

0.82

0.40

0.88

12

0.15

0.88

0.15

0.88

13

0.15

0.94

0.15

0.94

Indeed, Dr. Franco Benazzi, a prolific bipolar researcher, studied the sensitivity and specificity of the MDQ when question #3 was ignored entirely and found that the sensitivity was dramatically improved. In particular he found that sensitivity for Bipolar II was poor unless question #3 was omitted. Here is his online Letter to the Editor (Can J Psych).


A "big picture" look at sensitivity/specificity and the MDQ debate 

Remember that our "gold standard" for deriving sensitivity and specificity data is a clinical interview.  Some doctors in other specialties find it laughable (I have direct evidence of this) that we would spend time and energy fussing over validation data as above, when the gold standard from which these data derive is so subjective itself (compared, for example, to validation of a lab test to detect myocardial infarction where the gold standard is autopsy data).  But rather than shrinking from such derision with our statistical tail between our legs, we should remind ourselves about the point of validation studies. 

We just need to remember what the "test" is that we're validating.  The MDQ is a means of asking patients if they have the DSM criteria for bipolar disorder.  The main difference between the MDQ and a clinical interview is that there is no human to interpret their responses; otherwise the content of the queries is very similar.  So, what we really need to ask ourselves is how to improve the performance of a paper/pencil form of interviewing

One such improvement is noted above: add "while not on alcohol or drugs" to the lead-in for question #1 on the MDQ.  As clinicians we commonly search out some window in time when the patient was clearly abstinent, and use that period as the reference frame for our questions about hypo/manic symptoms.  Whether simply adding the quotation above can "set the frame" as well in the paper/pencil form remains to be seen.

Another improvement for MDQ (or BSDS) sensitivity is similarly obvious from common clinical practice:  find another informant.  Drs. Miller and Ghaemi and colleagues are now incorporating a measure of "insight" into their MDQ/BSDS research.  They find, as one might predict, that patients' lack of insight into their own hypo/mania is a very significant limiting factor for accurate diagnosis, just as it is in a one-on-one patient interview.  Clinicians commonly seek "collateral data", some other source of information on the patient, to circumvent this potential limitation. Thus, one other way to improve MDQ performance is to use it to gather collateral data as well.  One can ask patients' significant others to complete the same form "as though they were the patient".  This has been very illuminating in my own practice on several occasions.  

At the "big picture" level, gains in "sensitivity" by gathering input from significant others; and gains in specificity by being more precise about substance use as a confound;  may be greater than we can achieve from trying to get the right questions or format for a paper/pencil instrument.  In a primary care setting, the MDQ and BSDS appear to be sufficiently precise as screening tools, as long as doctors are wary of positives which appear when prior probability is low (low clinical suspicion), where the test has its lowest predictive value.  (If we wonder about doctors' ability to judge "low prior probability", this leads rather directly into the question of bipolar prevalence in patients with depression, with the attendant controversies over appropriate boundaries for bipolar diagnosis.  Might MDQ validity sometimes be proxy issue for critics whose real concern is the trend toward broad interpretations of "bipolar disorder"?)

In stark comparison, Zimmerman et al's concerns regarding the predictive values of the MDQ appear to be warranted when the test is used for widespread screening of the general population(e.g. Bush 2004), where prevalence will be low and thus predictive value of positive tests (even if one uses the most favorable data on sensitivity and specificity so far published) will be very low.  The risk of doing more harm than good with the test, thus used, is worth considering.