Annex B: Validation File Analysis

B1.0 Introduction

In order to assess file reviewer performance and inter-reviewer agreement, the medical records and ancillary data of 13 randomly selected subjects were used as validation files.  Validation files were reviewed by all three file reviewers, which generated three data collection forms for each validation file (one for each reviewer).  A fourth data collection form for each validation file was generated by the OHS, and this form served as the standard for comparison purposes.   

B2.0 Methods

B2.1 Procedures

For each validation file, a spreadsheet was created that summarized each file reviewer’s performance in relation to the standard for each data field in the data collection form.  The file reviewer data collection forms were reviewed by the OHS and individual data fields for each file reviewer was coded with one of five possible “response scores” in the spreadsheet:

  • “Correct”:  the information entered in the data field by the file reviewer matches the information entered in the corresponding data field in the standard;
  • “Data Error”:  the information entered in the data field by the file reviewer does not match the information entered in the corresponding data field in the standard;
  • “Created Data”:  information was entered in a data field by the file reviewer, but the corresponding data field in the standard is blank;
  • “Omitted Data”:  the file reviewer’s data field is blank but the corresponding data field in the standard has information entered in it; and
  • “Not Counted”:  the file reviewer’s data field is linked in some way to a related data field for which the file reviewer’s entry was incorrect.  For example, the results of the most recent pulmonary function test (PFT) just prior to 5 October 2004 were to be entered in the data form.  In addition to the date data field, the pre-fire PFT section of the form had 26 separate data fields to capture the different PFT parameters.  If the file reviewer selected the wrong pre-fire PFT (e.g., selected a PFT dated January 2002 instead of one dated June 2004), then even if the file reviewer abstracted the PFT parameter fields correctly, they would be coded as “data error” since they would not match the PFT parameter fields in the standard.  In these situations, the selection of the wrong item by the file reviewer was only counted as one error:  the date data field was coded as “data error” and the 26 PFT parameter data fields were coded as “not counted”.  In this manner, the file reviewer’s error was captured as a single error and not as 27 errors.

All identified errors (i.e., “data error,” “created data,” and “omitted data”), were further coded as either “minor” or “major.”  This coding was dependent on the judgment of the OHS and was meant to distinguish inconsequential errors (errors that would not significantly impact any future analyses) from errors that could potentially affect future analyses.  The following are examples of distinctions between minor and major errors:

  • A missed diagnosis of a significant medical condition (e.g., Asthma, PTSD, etc.) would count as a major error.  A missed diagnosis of a minor self-limited condition (e.g., self-limited upper respiratory tract infection) would count as a minor error;
  • Any error with respect to the labelling of a result as normal or abnormal was considered a major error (e.g., a complete blood count value that is outside the normal reference range that the file reviewer coded as “normal”);
  • A measured pulmonary function test parameter value that was greater than 1% higher or 1% lower than the corresponding standard value was considered a major error;
  • Adding or omitting a significant MEL or SL (i.e., greater than two weeks in length) was considered a major error.  Omitting a three-day SL, for example, was considered a minor error.  Regardless of MEL duration, if “unfit sub” or “unfit sea” was incorrectly added or omitted to the MEL text field, this counted as a major error; and
  • If the error was obvious and the correct value could easily be deduced by neighbouring fields, then it was counted as a minor error.  For example, a situation where a SL reason (text field) is “laryngoscopy” but the corresponding SL diagnostic category (selected from a drop-down menu) was “cardiovascular,” would be counted as a minor error.  In this situation, “cardiovascular” is adjacent to “ENT” (a more appropriate selection for a laryngoscopy) in the drop-down menu.  Since it was far more likely for the drop-down menu selection to have been made incorrectly (as opposed to typing in “laryngoscopy” when one meant to type in a cardiovascular-related reason for SL), the error is obvious and the correct value can easily be deduced.  

Each data fields in the data collection form was also assigned with a “variable type” code in the validation file spreadsheet that was created by the Principal Investigator.  The different variable types and the criteria for determining when a file reviewer’s data field entry was “correct” or “data error” are described below:

  • “Date”:  all date variables, whether a specific date data field in the data collection form, or if a date was entered into a text field as part of a medical history description.  If the date entered by the file reviewer was within two days of the corresponding date entry in the standard, then the file reviewer’s entry was coded as “correct”.  Otherwise, the file reviewer’s entry was coded as “data error”;
  • “Numeric”:  this variable type was assigned to all data fields that required the direct copying of a numeric value from a subject’s medical records into the data collection form.  Examples of numeric variables include values for laboratory results or pulmonary function test parameters.  File reviewer entries had to match the standard exactly to be coded as “correct”;
  • “Numeric – judgment”:  this variable type was assigned specifically to data fields that represented duration for SL, MELs, and medical categories.  In many cases, these specific values were not copied directly from a lab report, for example (as was the case for simple “numeric” data fields), but were from hand-written notes.  As well, the data collection form required that duration be entered in units of days, whereas duration for SL, MELs, and medical categories were frequently reported in medical records in units of weeks or months.  Because these numeric values required more thought and interpretation on the part of the file reviewer (as opposed to simply copying a value directly), they were coded as “numeric – judgment” in the validation file spreadsheet.  The file reviewer’s entry for a “numeric – judgment” variable had to be within one day of the corresponding standard entry to be coded as “correct”;
  • “Categorical”:  this variable type was assigned to all data fields that required the file reviewer to select a value from a fixed set of choices that existed in the data collection form.  Example data fields were drop-down lists and radio buttons.  The file reviewer’s entry had to match the standard exactly to be coded as “correct”;
  • “Free text – copy”:  this variable type was assigned to all data fields that required the file reviewer to copy the textual information found in a data source by typing it directly into a text data field in the data collection form.  Examples of these types of fields include posting location, medication name, investigation name, etc.  The file reviewer’s entry in a “free text – copy” field was coded as “correct” if one could reasonably determine that the file reviewer’s entry was equivalent to the corresponding standard entry.  In other words, differences between the file reviewer’s entry and the standard due to the use of abbreviations, spelling errors, or other nuances were taken into account and there did not need to be an “exact” match to be coded as “correct”; and
  • “Free text – judgment”:  this variable type was assigned to all text data fields that required the file reviewer to interpret information contained within a medical record and then summarize this information in the data collection form.  For example, medical history text fields required the file reviewer to provide a textual summary of the subject’s medical history.  From the medical history narrative, the Principal Investigator identified specific medical events, and each event would then be divided into three distinct variables:  the date of the event was coded as a “date” variable, the diagnosis was coded as a “free text – judgment” variable and one more “free text – judgment” variable for additional information, such as treatment.  Additionally, if the standard description of the event was “16 July 2003, pneumonia, Rx erythromycin” and the file reviewer’s entry was “July 16, 2003: cough, shortness of breath, abnormal x-ray, dx: pneumonia”, then that event would be coded as “correct” for the date variable, “correct” for the diagnosis free text – judgment variable, and “omitted data” for the additional information “free text – judgment” variable.  Similarly to the “free text – copy” variables, abbreviations, spelling errors, and other nuances were taken into account when evaluating the file reviewer’s entries.  Other examples of “free text – judgment” variables were the reasons or descriptions for SL, MELs, and medical categories.      

B2.2 Analyzes

Each validation file spreadsheet created by the Principal Investigator, Dr. S. Tsekrekos, listed all data fields that were in the data collection form, and had the following columns with coding for each data field:

  • Variable type (e.g., date, numeric, etc.);
  • Present in standard (a binary code indicating if the data field was completed in the standard); and
  • Two columns for each of the three file reviewers:  “Response” (e.g., correct, data error, created data, etc.) and “Error severity” (i.e., minor, major).

In order to derive an accuracy score, the data fields that were scored as “correct” for a file reviewer served as that file reviewer’s numerator.  The denominator was the number of data fields that were present in the standard minus the number of data fields that were “not counted” (see discussion above) for that file reviewer.  To illustrate, a hypothetical standard for a validation file had 500 completed data fields.  A file reviewer’s data collection form for the same validation file had 430 “correct” data fields.  Expanding on the example from the above “not counted” discussion, one of the errors made was on a pulmonary function test (PFT) entry:  the wrong PFT was selected from the medical record, and so the date and all the PFT parameter values did not match the standard.  The PFT date field was coded as “data error” but the 26 PFT parameter fields on the file reviewer’s data collection form were coded as “not counted” (so that the file reviewer’s single error in selecting the wrong PFT was not scored as 27 errors).  Assuming that there were no more “not counted” data fields for this file reviewer, their accuracy score would be 430 / (500 – 26) = 0.907 or 90.7%.

In situations where a file reviewer’s data field was scored as “created data”, the number of “created data” errors were subtracted from the “correct” data fields, with no change to the denominator.  Continuing with the preceding example, if the file reviewer also made five “created data” errors, then the accuracy calculation would be:  (430 – 5) / (500 – 26) = 0.897 or 89.7%.

In order to determine inter-reviewer agreement, response scores (e.g., “correct,” “data error”, etc.) were compared across file reviewers and a file agreement code assigned:

  • All reviewers agree (all had the same response score);
  • Reviewer 1 and reviewer 2 agree only (reviewer 3 had a different response score);
  • Reviewer 1 and reviewer 3 agree only (reviewer 2 had a different response score);
  • Reviewer 2 and reviewer 3 agree only (reviewer 1 had a different response score); and
  • No reviewers agree (all three reviewer’s response scores differed).

Note that agreement scores were not necessarily directly related to accuracy (in comparison to the standard).  For example, if all three file reviewers had “omitted data” for a data field that was in the standard, then this would be coded as “all reviewers agree”.  In the vast majority of cases, however, all three reviewers were in agreement or two reviewers were in agreement because their data fields were “correct” in relation to the standard.

B3.0 Results

B3.1 File Reviewer Accuracy

Table B1 summarizes the spreadsheet coding for each validation file.  Unless indicated otherwise, the numbers in Table B2 represent the number of data fields.  Accuracy is expressed as a percentage and is simply the “correct” responses divided by the denominator.  Note that the denominators differ from the number of data fields in the standard, which is due to the presence of “not counted” data fields, as discussed above.  The number of major errors is also provided in the table, which ranged from 0 to 9 across all reviewers and all validation files.  When expressed as a percentage of total errors (i.e., total errors equals the denominator minus the “correct” responses value), the proportion of errors that were “major” ranged from 0% to 17.2% of total errors for a given validation file.

With respect to average accuracy, when all validation files were equally weighted (i.e., the simple average of the 13 validation files accuracies), the average accuracies for Reviewers 1, 2, and 3 were 97.0%, 92.6%, and 90.9%, respectively.  Because the validation files had different numbers of completed data fields in the standard (from a low of 347 to a high of 872), a more appropriate measure of “total accuracy” is the total correct responses divided by the total denominators when all validation file results were added together.  With this approach, the average accuracies was 96.8%, 92.5%, and 90.7%, respectively.  When all three reviewers were combined using this “total accuracy”, the total average accuracy was 93.3%.  Considering the substantial time required to review a subject’s medical record (up to two to three days in some cases), and the number of pieces of information that the file reviewer was required to abstract into the data collection forms, this level of accuracy was considered acceptable.

Table B1: File Reviewer performance for all 13 validation files*
Validation File Number Standard Data Fields Performance Reviewer 1 Reviewer 2 Reviewer 3
1 573 Correct 536 449 451
Denominator 554 504 508
Accuracy (%) 96.8% 89.1% 88.8%
Major Errors 0 4 1
2 416 Correct 410 388 377
Denominator 415 407 400
Accuracy (%) 98.8% 95.3% 94.3%
Major Errors 0 1 1
3 848 Correct 807 661 653
Denominator 829 737 739
Accuracy (%) 97.3% 89.7% 88.4%
Major Errors 0 2 2
4 794 Correct 716 687 648
Denominator 756 735 716
Accuracy (%) 94.7% 93.5% 90.5%
Major Errors 0 3 5
5 842 Correct 756 626 635
Denominator 799 719 723
Accuracy (%) 94.6% 87.1% 87.8%
Major Errors 0 5 3
6 411 Correct 371 319 312
Denominator 380 353 357
Accuracy (%) 97.6% 90.4% 87.4%
Major Errors 0 2 0
7 362 Correct 356 321 285
Denominator 362 341 310
Accuracy (%) 98.3% 94.1% 91.9%
Major Errors 0 1 1
8 821 Correct 765 722 697
Denominator 794 776 760
Accuracy (%) 96.3% 93.0% 91.7%
Major Errors 5 1 3
9 872 Correct 799 769 709
Denominator 829 821 797
Accuracy (%) 96.4% 93.7% 89.0%
Major Errors 2 5 9
10 837 Correct 800 756 736
Denominator 819 795 786
Accuracy (%) 97.7% 95.1% 93.6%
Major Errors 0 0 0
11 347 Correct 334 302 293
Denominator 342 317 313
Accuracy (%) 97.7% 95.3% 93.6%
Major Errors 0 1 1
12 472 Correct 432 383 402
Denominator 448 420 438
Accuracy (%) 96.4% 91.2% 91.8%
Major Errors 0 0 0
13 565 Correct 539 518 514
Denominator 547 539 553
Accuracy (%) 98.5% 96.1% 92.9%
Major Errors 0 0 1

* Unless otherwise noted, the numbers in the table represent the number of data fields.

As mentioned in section 2.5.4, if file reviewers were second or third in line to review the validation file, they would be aware that their work would be evaluated.  This may have resulted in the second and third file reviewers completing the form with more care and precision than they normally would.  If such an effect were present, the validation file assessment of the file reviewer accuracy would overestimate the true accuracy of the file reviewers. 

In order to assess for this “accuracy bias”, the performance of the file reviewer when they were a first (“blinded”) reviewer of a validation file was compared to their performance when they were a second or third reviewer on a validation file (“unblinded”).  The results are summarized in Table B2.  If an accuracy bias was present, it seems to have had only a minimal effect and was largely restricted to Reviewer 2, who had a 3% higher average accuracy when “unblinded”, as compared to a 0.6% higher average accuracy for Reviewer 1 and a 1.9% lower average accuracy for Reviewer 3.  When the performance of all three file reviewers was considered together, the average accuracy differed by only 0.2% between the “blinded” first review (93.3%) and the “unblinded” second or third review (93.5%).  These results suggest that the validation file accuracy scores were a reasonable reflection of the accuracy of the file reviewers over the course of the entire study when reviewing non-validation files.

Table B2: Comparison of File Reviewer performance as a function of the order in which they reviewed the file: files where a reviewer was the first to review versus files that the reviewer was a second or third reviewer
  Reviewer 1
1st ("blinded")
Reviewer 1
2nd or 3rd ("unblinded")
Reviewer 2
1st ("blinded")
Reviewer 2
2nd or 3rd ("unblinded")
Reviewer 3
1st ("blinded")
Reviewer 3
2nd or 3rd ("unblinded")
Validation File Numbers 1, 2, 4, 9, 12 3, 5, 6, 7, 8, 10, 11, 13 3, 5, 7, 11 1, 2, 4, 6, 8, 9, 10, 12, 13 6, 8, 10, 13 1, 2, 3, 4, 5, 7, 9, 11, 12
Total Correct 2893 4728 1910 4991 2259 4453
Total Denominator 3002 4872 2114 5350 2456 4944
Accuracy Score (%) 96.4% 97.0% 90.4% 93.3% 92.0% 90.1%

Similar to the effect of validation file review order, the effect of file complexity was also assessed.  A marker of validation file complexity was the number of data fields that were completed in the standard.  The greater the number of data fields, then the greater the amount of information that had to be abstracted from subject medical records and ancillary information.  This would have increased the time required to complete the data collection form, and fatigue or decreased concentration over time may have had a negative impact on file reviewer accuracy.

In order to assess the effect of file complexity, reviewer performance was compared between the three validation files with the fewest standard data fields and the three validation files with the most standard data fields.  The results are shown in Table B3.  As predicted, the average accuracy score for the files with the highest number of standard data fields was slightly less (1.8 to 2.9%) than the average accuracy score for the files with the lowest number of standard data fields, and this was a consistent finding for all three file reviewers.  When the performance of all three file reviewers was considered together, the average accuracy was 94.1% for the three validation files with the fewest standard data fields, which was 2.4% greater than the average accuracy of 91.7% for the three validation files with the most standard data fields.

Table B3: Comparison of File Reviewer performance as a function of the length/complexity of the validation file, based on the number of data fields
  Reviewer 1
Fewest Data Fields
Reviewer 1
Most Data Fields
Reviewer 2
Fewest Data Fields
Reviewer 2
Most Data Fields
Reviewer 3
Fewest Data Fields
Reviewer 3
Most Data Fields
Validation File Numbers 6,7,11 3,5,9 6,7,11 3,5,9 6,7,11 3,5,9
Total Correct 1061 2362 942 2056 890 1997
Total Denominator 1084 2457 1011 2277 980 2259
Accuracy Score (%) 97.9% 96.1% 93.2% 90.3% 90.8% 88.4%

The possibility of learning effects on the file reviewer accuracy was also assessed.  The average accuracy of the first three validation files that a file reviewer completed was compared to the last three files that a file reviewer completed.  Depending on the file reviewer, there was a seven to twelve-month separation between the first and last validation files reviewed. 

The results are shown in Table B4 and suggest a slight learning effect across for all three file reviewers, most pronounced in Reviewer 2.  When the performance of all three file reviewers was considered together, the average accuracy was 92.2% for the first three validation files reviewed and 93.1% for the last three validation files reviewed.

Table B4: Comparison of File Reviewer performance over time:  first three validation files reviewed versus the last three validation files reviewed
  Reviewer 1
First Files
Reviewer 1
Last Files
Reviewer 2
First Files
Reviewer 2
Last Files
Reviewer 3
First Files
Reviewer 3
Last Files
Validation File Numbers 1,2,4 11,12,13 1,2,3 11,12,13 5,6,7 1,3,4
Total Correct 1662 1305 1498 1203 1232 1752
Total Denominator 1725 1337 1648 1276 1390 1963
Accuracy Score (%) 96.3% 97.6% 90.9% 94.3% 88.6% 89.3%

File reviewer accuracy was also influenced by the type of variable (e.g., “date,” “numeric”, etc.).  This is summarized in Table B5.  Not unexpectedly, data fields that required more thought or interpretation on the part of the file reviewer (i.e., “Numeric – judgment,” Text – judgment”) were somewhat more prone to error than data fields that required direct copying or selection from a fixed set of choices (i.e., “Numeric,” “Categorical,” “Text – copy”), as indicated by the accuracy scores for the different variable types. 

“Date” variables had the lowest accuracy scores, but this is not simply a reflection of inaccuracy on the part of the file reviewers with respect to copying date information.  This accuracy score also takes into account omitted and created data errors that involved multiple related data fields (e.g., lab results such as complete blood counts or investigations such as pulmonary function tests, etc.); if the wrong laboratory or investigation result was entered into the data collection form, then that error would be scored towards the date variable.

Table B5: Overall File Reviewer performance for different variable types
Variable Type Reviewer 1 (Average Accuracy Score) Reviewer 2 (Average Accuracy Score) Reviewer 3 (Average Accuracy Score) All Reviewers Combined (Average Accuracy Score)
Date 93.02 83.56 81.93 86.17
Numeric 99.24 98.22 93.79 97.08
Numeric - judgement 92.14 90.32 91.91 91.46
Categorical 97.39 95.76 95.14 96.10
Text - copy 98.87 93.18 93.25 95.10
Text - judgement 96.83 91.04 90.44 92.77

B3.1 File Reviewer Agreement

File reviewer response score agreement for each of the validation files is summarized in Table B6.  On average, for just over 73% of data fields, all three file reviewers had the same response score.  For the vast majority of validation file data fields, this 100% agreement occurred because all three file reviewers had the correct data field entry as compared to the standard.  On rare occasion, 100% agreement occurred because all three file reviewers had the same type of error for a particular data field (e.g., “omitted data”). 

Less than 1% of the total data fields for all validation files combined had no agreement between the three file reviewers (such as situation when the three reviewer response scores for a particular data field were “correct,” “data error”, and “omitted data”, for example). 

The results of the agreement analyses suggest that, overall, the three file reviewers abstracted information from medical files and ancillary information in a similar manner.

Table B6: Summary of the File Reviewer agreement for all 13 validation files
File Reviewer Response Score Agreement (% of data fields)*
Validation File Number 100% Agreement (3/3 with same response score) 67% Agreement (2/3 with same response score) No Agreement (different response score for all 3)
1 62.1% 36.7% 1.2%
2 86.1% 14.0% 0%
3 70.6% 28.5% 0.8%
4 71.8% 27.9% 0.3%
5 69.0% 28.7% 2.3%
6 58.3% 41.4% 0.2%
7 75.4% 24.6% 0%
8 76.7% 22.3% 1.0%
9 71.3% 28.0% 0.7%
10 81.5% 18.0% 0.6%
11 72.1% 26.7% 1.1%
12 71.7% 26.6% 1.7%
13 81.9% 18.1% 0%
All Combined 73.1% 26.1% 0.8%

* Rows might not total to exactly 100% due to rounding

Page details

Date modified: