Annex B: Validation File Analysis

B1.0 Introduction

In order to assess file reviewer performance and inter-reviewer agreement, the medical records and ancillary data of 13 randomly selected subjects were used as validation files. Validation files were reviewed by all three file reviewers, which generated three data collection forms for each validation file (one for each reviewer). A fourth data collection form for each validation file was generated by the OHS, and this form served as the standard for comparison purposes.

B2.0 Methods

B2.1 Procedures

For each validation file, a spreadsheet was created that summarized each file reviewer’s performance in relation to the standard for each data field in the data collection form. The file reviewer data collection forms were reviewed by the OHS and individual data fields for each file reviewer was coded with one of five possible “response scores” in the spreadsheet:

“Correct”: the information entered in the data field by the file reviewer matches the information entered in the corresponding data field in the standard;
“Data Error”: the information entered in the data field by the file reviewer does not match the information entered in the corresponding data field in the standard;
“Created Data”: information was entered in a data field by the file reviewer, but the corresponding data field in the standard is blank;
“Omitted Data”: the file reviewer’s data field is blank but the corresponding data field in the standard has information entered in it; and
“Not Counted”: the file reviewer’s data field is linked in some way to a related data field for which the file reviewer’s entry was incorrect. For example, the results of the most recent pulmonary function test (PFT) just prior to 5 October 2004 were to be entered in the data form. In addition to the date data field, the pre-fire PFT section of the form had 26 separate data fields to capture the different PFT parameters. If the file reviewer selected the wrong pre-fire PFT (e.g., selected a PFT dated January 2002 instead of one dated June 2004), then even if the file reviewer abstracted the PFT parameter fields correctly, they would be coded as “data error” since they would not match the PFT parameter fields in the standard. In these situations, the selection of the wrong item by the file reviewer was only counted as one error: the date data field was coded as “data error” and the 26 PFT parameter data fields were coded as “not counted”. In this manner, the file reviewer’s error was captured as a single error and not as 27 errors.

All identified errors (i.e., “data error,” “created data,” and “omitted data”), were further coded as either “minor” or “major.” This coding was dependent on the judgment of the OHS and was meant to distinguish inconsequential errors (errors that would not significantly impact any future analyses) from errors that could potentially affect future analyses. The following are examples of distinctions between minor and major errors:

A missed diagnosis of a significant medical condition (e.g., Asthma, PTSD, etc.) would count as a major error. A missed diagnosis of a minor self-limited condition (e.g., self-limited upper respiratory tract infection) would count as a minor error;
Any error with respect to the labelling of a result as normal or abnormal was considered a major error (e.g., a complete blood count value that is outside the normal reference range that the file reviewer coded as “normal”);
A measured pulmonary function test parameter value that was greater than 1% higher or 1% lower than the corresponding standard value was considered a major error;
Adding or omitting a significant MEL or SL (i.e., greater than two weeks in length) was considered a major error. Omitting a three-day SL, for example, was considered a minor error. Regardless of MEL duration, if “unfit sub” or “unfit sea” was incorrectly added or omitted to the MEL text field, this counted as a major error; and
If the error was obvious and the correct value could easily be deduced by neighbouring fields, then it was counted as a minor error. For example, a situation where a SL reason (text field) is “laryngoscopy” but the corresponding SL diagnostic category (selected from a drop-down menu) was “cardiovascular,” would be counted as a minor error. In this situation, “cardiovascular” is adjacent to “ENT” (a more appropriate selection for a laryngoscopy) in the drop-down menu. Since it was far more likely for the drop-down menu selection to have been made incorrectly (as opposed to typing in “laryngoscopy” when one meant to type in a cardiovascular-related reason for SL), the error is obvious and the correct value can easily be deduced.

Each data fields in the data collection form was also assigned with a “variable type” code in the validation file spreadsheet that was created by the Principal Investigator. The different variable types and the criteria for determining when a file reviewer’s data field entry was “correct” or “data error” are described below:

“Date”: all date variables, whether a specific date data field in the data collection form, or if a date was entered into a text field as part of a medical history description. If the date entered by the file reviewer was within two days of the corresponding date entry in the standard, then the file reviewer’s entry was coded as “correct”. Otherwise, the file reviewer’s entry was coded as “data error”;
“Numeric”: this variable type was assigned to all data fields that required the direct copying of a numeric value from a subject’s medical records into the data collection form. Examples of numeric variables include values for laboratory results or pulmonary function test parameters. File reviewer entries had to match the standard exactly to be coded as “correct”;
“Numeric – judgment”: this variable type was assigned specifically to data fields that represented duration for SL, MELs, and medical categories. In many cases, these specific values were not copied directly from a lab report, for example (as was the case for simple “numeric” data fields), but were from hand-written notes. As well, the data collection form required that duration be entered in units of days, whereas duration for SL, MELs, and medical categories were frequently reported in medical records in units of weeks or months. Because these numeric values required more thought and interpretation on the part of the file reviewer (as opposed to simply copying a value directly), they were coded as “numeric – judgment” in the validation file spreadsheet. The file reviewer’s entry for a “numeric – judgment” variable had to be within one day of the corresponding standard entry to be coded as “correct”;
“Categorical”: this variable type was assigned to all data fields that required the file reviewer to select a value from a fixed set of choices that existed in the data collection form. Example data fields were drop-down lists and radio buttons. The file reviewer’s entry had to match the standard exactly to be coded as “correct”;
“Free text – copy”: this variable type was assigned to all data fields that required the file reviewer to copy the textual information found in a data source by typing it directly into a text data field in the data collection form. Examples of these types of fields include posting location, medication name, investigation name, etc. The file reviewer’s entry in a “free text – copy” field was coded as “correct” if one could reasonably determine that the file reviewer’s entry was equivalent to the corresponding standard entry. In other words, differences between the file reviewer’s entry and the standard due to the use of abbreviations, spelling errors, or other nuances were taken into account and there did not need to be an “exact” match to be coded as “correct”; and
“Free text – judgment”: this variable type was assigned to all text data fields that required the file reviewer to interpret information contained within a medical record and then summarize this information in the data collection form. For example, medical history text fields required the file reviewer to provide a textual summary of the subject’s medical history. From the medical history narrative, the Principal Investigator identified specific medical events, and each event would then be divided into three distinct variables: the date of the event was coded as a “date” variable, the diagnosis was coded as a “free text – judgment” variable and one more “free text – judgment” variable for additional information, such as treatment. Additionally, if the standard description of the event was “16 July 2003, pneumonia, Rx erythromycin” and the file reviewer’s entry was “July 16, 2003: cough, shortness of breath, abnormal x-ray, dx: pneumonia”, then that event would be coded as “correct” for the date variable, “correct” for the diagnosis free text – judgment variable, and “omitted data” for the additional information “free text – judgment” variable. Similarly to the “free text – copy” variables, abbreviations, spelling errors, and other nuances were taken into account when evaluating the file reviewer’s entries. Other examples of “free text – judgment” variables were the reasons or descriptions for SL, MELs, and medical categories.

B2.2 Analyzes

Each validation file spreadsheet created by the Principal Investigator, Dr. S. Tsekrekos, listed all data fields that were in the data collection form, and had the following columns with coding for each data field:

Variable type (e.g., date, numeric, etc.);
Present in standard (a binary code indicating if the data field was completed in the standard); and
Two columns for each of the three file reviewers: “Response” (e.g., correct, data error, created data, etc.) and “Error severity” (i.e., minor, major).

In order to derive an accuracy score, the data fields that were scored as “correct” for a file reviewer served as that file reviewer’s numerator. The denominator was the number of data fields that were present in the standard minus the number of data fields that were “not counted” (see discussion above) for that file reviewer. To illustrate, a hypothetical standard for a validation file had 500 completed data fields. A file reviewer’s data collection form for the same validation file had 430 “correct” data fields. Expanding on the example from the above “not counted” discussion, one of the errors made was on a pulmonary function test (PFT) entry: the wrong PFT was selected from the medical record, and so the date and all the PFT parameter values did not match the standard. The PFT date field was coded as “data error” but the 26 PFT parameter fields on the file reviewer’s data collection form were coded as “not counted” (so that the file reviewer’s single error in selecting the wrong PFT was not scored as 27 errors). Assuming that there were no more “not counted” data fields for this file reviewer, their accuracy score would be 430 / (500 – 26) = 0.907 or 90.7%.

In situations where a file reviewer’s data field was scored as “created data”, the number of “created data” errors were subtracted from the “correct” data fields, with no change to the denominator. Continuing with the preceding example, if the file reviewer also made five “created data” errors, then the accuracy calculation would be: (430 – 5) / (500 – 26) = 0.897 or 89.7%.

In order to determine inter-reviewer agreement, response scores (e.g., “correct,” “data error”, etc.) were compared across file reviewers and a file agreement code assigned:

All reviewers agree (all had the same response score);
Reviewer 1 and reviewer 2 agree only (reviewer 3 had a different response score);
Reviewer 1 and reviewer 3 agree only (reviewer 2 had a different response score);
Reviewer 2 and reviewer 3 agree only (reviewer 1 had a different response score); and
No reviewers agree (all three reviewer’s response scores differed).

Note that agreement scores were not necessarily directly related to accuracy (in comparison to the standard). For example, if all three file reviewers had “omitted data” for a data field that was in the standard, then this would be coded as “all reviewers agree”. In the vast majority of cases, however, all three reviewers were in agreement or two reviewers were in agreement because their data fields were “correct” in relation to the standard.

B3.0 Results

B3.1 File Reviewer Accuracy

Table B1 summarizes the spreadsheet coding for each validation file. Unless indicated otherwise, the numbers in Table B2 represent the number of data fields. Accuracy is expressed as a percentage and is simply the “correct” responses divided by the denominator. Note that the denominators differ from the number of data fields in the standard, which is due to the presence of “not counted” data fields, as discussed above. The number of major errors is also provided in the table, which ranged from 0 to 9 across all reviewers and all validation files. When expressed as a percentage of total errors (i.e., total errors equals the denominator minus the “correct” responses value), the proportion of errors that were “major” ranged from 0% to 17.2% of total errors for a given validation file.

With respect to average accuracy, when all validation files were equally weighted (i.e., the simple average of the 13 validation files accuracies), the average accuracies for Reviewers 1, 2, and 3 were 97.0%, 92.6%, and 90.9%, respectively. Because the validation files had different numbers of completed data fields in the standard (from a low of 347 to a high of 872), a more appropriate measure of “total accuracy” is the total correct responses divided by the total denominators when all validation file results were added together. With this approach, the average accuracies was 96.8%, 92.5%, and 90.7%, respectively. When all three reviewers were combined using this “total accuracy”, the total average accuracy was 93.3%. Considering the substantial time required to review a subject’s medical record (up to two to three days in some cases), and the number of pieces of information that the file reviewer was required to abstract into the data collection forms, this level of accuracy was considered acceptable.

Table B1: File Reviewer performance for all 13 validation files*
Validation File Number	Standard Data Fields	Performance	Reviewer 1	Reviewer 2	Reviewer 3
1	573	Correct	536	449	451
		Denominator	554	504	508
		Accuracy (%)	96.8%	89.1%	88.8%
		Major Errors	0	4	1
2	416	Correct	410	388	377
		Denominator	415	407	400
		Accuracy (%)	98.8%	95.3%	94.3%
		Major Errors	0	1	1
3	848	Correct	807	661	653
		Denominator	829	737	739
		Accuracy (%)	97.3%	89.7%	88.4%
		Major Errors	0	2	2
4	794	Correct	716	687	648
		Denominator	756	735	716
		Accuracy (%)	94.7%	93.5%	90.5%
		Major Errors	0	3	5
5	842	Correct	756	626	635
		Denominator	799	719	723
		Accuracy (%)	94.6%	87.1%	87.8%
		Major Errors	0	5	3
6	411	Correct	371	319	312
		Denominator	380	353	357
		Accuracy (%)	97.6%	90.4%	87.4%
		Major Errors	0	2	0
7	362	Correct	356	321	285
		Denominator	362	341	310
		Accuracy (%)	98.3%	94.1%	91.9%
		Major Errors	0	1	1
8	821	Correct	765	722	697
		Denominator	794	776	760
		Accuracy (%)	96.3%	93.0%	91.7%
		Major Errors	5	1	3
9	872	Correct	799	769	709
		Denominator	829	821	797
		Accuracy (%)	96.4%	93.7%	89.0%
		Major Errors	2	5	9
10	837	Correct	800	756	736
		Denominator	819	795	786
		Accuracy (%)	97.7%	95.1%	93.6%
		Major Errors	0	0	0
11	347	Correct	334	302	293
		Denominator	342	317	313
		Accuracy (%)	97.7%	95.3%	93.6%
		Major Errors	0	1	1
12	472	Correct	432	383	402
		Denominator	448	420	438
		Accuracy (%)	96.4%	91.2%	91.8%
		Major Errors	0	0	0
13	565	Correct	539	518	514
		Denominator	547	539	553
		Accuracy (%)	98.5%	96.1%	92.9%
		Major Errors	0	0	1

* Unless otherwise noted, the numbers in the table represent the number of data fields.

As mentioned in section 2.5.4, if file reviewers were second or third in line to review the validation file, they would be aware that their work would be evaluated. This may have resulted in the second and third file reviewers completing the form with more care and precision than they normally would. If such an effect were present, the validation file assessment of the file reviewer accuracy would overestimate the true accuracy of the file reviewers.

In order to assess for this “accuracy bias”, the performance of the file reviewer when they were a first (“blinded”) reviewer of a validation file was compared to their performance when they were a second or third reviewer on a validation file (“unblinded”). The results are summarized in Table B2. If an accuracy bias was present, it seems to have had only a minimal effect and was largely restricted to Reviewer 2, who had a 3% higher average accuracy when “unblinded”, as compared to a 0.6% higher average accuracy for Reviewer 1 and a 1.9% lower average accuracy for Reviewer 3. When the performance of all three file reviewers was considered together, the average accuracy differed by only 0.2% between the “blinded” first review (93.3%) and the “unblinded” second or third review (93.5%). These results suggest that the validation file accuracy scores were a reasonable reflection of the accuracy of the file reviewers over the course of the entire study when reviewing non-validation files.

Table B2: Comparison of File Reviewer performance as a function of the order in which they reviewed the file: files where a reviewer was the first to review versus files that the reviewer was a second or third reviewer
	Reviewer 1 1st ("blinded")	Reviewer 1 2nd or 3rd ("unblinded")	Reviewer 2 1st ("blinded")	Reviewer 2 2nd or 3rd ("unblinded")	Reviewer 3 1st ("blinded")	Reviewer 3 2nd or 3rd ("unblinded")
Validation File Numbers	1, 2, 4, 9, 12	3, 5, 6, 7, 8, 10, 11, 13	3, 5, 7, 11	1, 2, 4, 6, 8, 9, 10, 12, 13	6, 8, 10, 13	1, 2, 3, 4, 5, 7, 9, 11, 12
Total Correct	2893	4728	1910	4991	2259	4453
Total Denominator	3002	4872	2114	5350	2456	4944
Accuracy Score (%)	96.4%	97.0%	90.4%	93.3%	92.0%	90.1%

Similar to the effect of validation file review order, the effect of file complexity was also assessed. A marker of validation file complexity was the number of data fields that were completed in the standard. The greater the number of data fields, then the greater the amount of information that had to be abstracted from subject medical records and ancillary information. This would have increased the time required to complete the data collection form, and fatigue or decreased concentration over time may have had a negative impact on file reviewer accuracy.

In order to assess the effect of file complexity, reviewer performance was compared between the three validation files with the fewest standard data fields and the three validation files with the most standard data fields. The results are shown in Table B3. As predicted, the average accuracy score for the files with the highest number of standard data fields was slightly less (1.8 to 2.9%) than the average accuracy score for the files with the lowest number of standard data fields, and this was a consistent finding for all three file reviewers. When the performance of all three file reviewers was considered together, the average accuracy was 94.1% for the three validation files with the fewest standard data fields, which was 2.4% greater than the average accuracy of 91.7% for the three validation files with the most standard data fields.

Table B3: Comparison of File Reviewer performance as a function of the length/complexity of the validation file, based on the number of data fields
	Reviewer 1 Fewest Data Fields	Reviewer 1 Most Data Fields	Reviewer 2 Fewest Data Fields	Reviewer 2 Most Data Fields	Reviewer 3 Fewest Data Fields	Reviewer 3 Most Data Fields
Validation File Numbers	6,7,11	3,5,9	6,7,11	3,5,9	6,7,11	3,5,9
Total Correct	1061	2362	942	2056	890	1997
Total Denominator	1084	2457	1011	2277	980	2259
Accuracy Score (%)	97.9%	96.1%	93.2%	90.3%	90.8%	88.4%

The possibility of learning effects on the file reviewer accuracy was also assessed. The average accuracy of the first three validation files that a file reviewer completed was compared to the last three files that a file reviewer completed. Depending on the file reviewer, there was a seven to twelve-month separation between the first and last validation files reviewed.

The results are shown in Table B4 and suggest a slight learning effect across for all three file reviewers, most pronounced in Reviewer 2. When the performance of all three file reviewers was considered together, the average accuracy was 92.2% for the first three validation files reviewed and 93.1% for the last three validation files reviewed.

Table B4: Comparison of File Reviewer performance over time: first three validation files reviewed versus the last three validation files reviewed
	Reviewer 1 First Files	Reviewer 1 Last Files	Reviewer 2 First Files	Reviewer 2 Last Files	Reviewer 3 First Files	Reviewer 3 Last Files
Validation File Numbers	1,2,4	11,12,13	1,2,3	11,12,13	5,6,7	1,3,4
Total Correct	1662	1305	1498	1203	1232	1752
Total Denominator	1725	1337	1648	1276	1390	1963
Accuracy Score (%)	96.3%	97.6%	90.9%	94.3%	88.6%	89.3%

File reviewer accuracy was also influenced by the type of variable (e.g., “date,” “numeric”, etc.). This is summarized in Table B5. Not unexpectedly, data fields that required more thought or interpretation on the part of the file reviewer (i.e., “Numeric – judgment,” Text – judgment”) were somewhat more prone to error than data fields that required direct copying or selection from a fixed set of choices (i.e., “Numeric,” “Categorical,” “Text – copy”), as indicated by the accuracy scores for the different variable types.

“Date” variables had the lowest accuracy scores, but this is not simply a reflection of inaccuracy on the part of the file reviewers with respect to copying date information. This accuracy score also takes into account omitted and created data errors that involved multiple related data fields (e.g., lab results such as complete blood counts or investigations such as pulmonary function tests, etc.); if the wrong laboratory or investigation result was entered into the data collection form, then that error would be scored towards the date variable.

Table B5: Overall File Reviewer performance for different variable types
Variable Type	Reviewer 1 (Average Accuracy Score)	Reviewer 2 (Average Accuracy Score)	Reviewer 3 (Average Accuracy Score)	All Reviewers Combined (Average Accuracy Score)
Date	93.02	83.56	81.93	86.17
Numeric	99.24	98.22	93.79	97.08
Numeric - judgement	92.14	90.32	91.91	91.46
Categorical	97.39	95.76	95.14	96.10
Text - copy	98.87	93.18	93.25	95.10
Text - judgement	96.83	91.04	90.44	92.77

B3.1 File Reviewer Agreement

File reviewer response score agreement for each of the validation files is summarized in Table B6. On average, for just over 73% of data fields, all three file reviewers had the same response score. For the vast majority of validation file data fields, this 100% agreement occurred because all three file reviewers had the correct data field entry as compared to the standard. On rare occasion, 100% agreement occurred because all three file reviewers had the same type of error for a particular data field (e.g., “omitted data”).

Less than 1% of the total data fields for all validation files combined had no agreement between the three file reviewers (such as situation when the three reviewer response scores for a particular data field were “correct,” “data error”, and “omitted data”, for example).

The results of the agreement analyses suggest that, overall, the three file reviewers abstracted information from medical files and ancillary information in a similar manner.

Table B6: Summary of the File Reviewer agreement for all 13 validation files
File Reviewer Response Score Agreement (% of data fields)*
Validation File Number	100% Agreement (3/3 with same response score)	67% Agreement (2/3 with same response score)	No Agreement (different response score for all 3)
1	62.1%	36.7%	1.2%
2	86.1%	14.0%	0%
3	70.6%	28.5%	0.8%
4	71.8%	27.9%	0.3%
5	69.0%	28.7%	2.3%
6	58.3%	41.4%	0.2%
7	75.4%	24.6%	0%
8	76.7%	22.3%	1.0%
9	71.3%	28.0%	0.7%
10	81.5%	18.0%	0.6%
11	72.1%	26.7%	1.1%
12	71.7%	26.6%	1.7%
13	81.9%	18.1%	0%
All Combined	73.1%	26.1%	0.8%

* Rows might not total to exactly 100% due to rounding

Page details

Date modified:: 2019-07-11

Language selection

Search