ARCHIVED – An Examination of the Canadian Language Benchmark Data from the Citizenship Language Survey
Problems with the data set
The main problems that we have identified with the data set include the need for extensive data cleaning, and recording procedures, and finally, shortcomings of the survey questions themselves. We will deal with each of these problems separately.
When we received the data in mid-January of 2007, we noted that the Toronto and Vancouver files had fewer questions than the files from the other cities, and that they were in SPSS format while the other files were in Excel. In order to make direct comparisons, all the data had to be merged into a single file, which required a great deal of data management. Considerable additional cleaning was also needed because of inconsistencies in data entry procedures. Furthermore, we recoded the occupation variables using the National Occupational Coding (NOC) system. These procedures were carried out between March and June of 2007. Statistical analyses were then performed on the resulting data set.
At the same time that we received the data, we also received documents entitled “CIC Language Surveys: Sample Development and Data Management” and “Description of Data Files.” Although these documents provided us with some background details, they were of limited value in helping us understand some of the data collection procedures.
We have resolved, to the best of our ability, some of the problems arising from inadequate data cleaning. The data from this project, now combined into a single file with consistent coding, are available to CIC for further perusal. Two of the larger problems encountered in the cleaning phase were as follows:
- The data included incompatible Excel and SPSS files, requiring extensive manipulations prior to merging
- Extensive recoding had to be carried out. For example, there were nine different codes for sex! This was a relatively simple irregularity that had to be corrected. However other modifications were far more complex. Much of the trouble with the data set was due to faulty data collection procedures that will be discussed in the next section.
There are high numbers of non-responses or missing data across many categories. Perhaps many of the participants did not understand particular questions and thus could not respond, but the numbers are so high that we suspect that there is assessor error involved here as well.
Another assessor-related problem was the inconsistent coding of data throughout. Although some data cleaning is inevitable in this type of work, assessor training in this area might have eliminated the high degree of variability in the coding of responses, and would have cut down on the number of hours required to clean the data. For instance, the many spelling irregularities made it impossible to automatically convert string to numeric variables, and inconsistent recording of dates had to be rectified manually. Furthermore, there were instances of mixing of string and numeric coding across cities, which meant that city files could not be merged until recoding was carried out.
A related problem we encountered was the assignment of the same ID numbers to individuals in different cities. This apparently happened because participants were assigned IDs at the testing sites rather than through a more centralized process.
Some responses to questions indicated inadequate probing on the part of the assessors. For example, responses to the question about current occupation included such things as ‘works at the Bay’, ‘technician’, and ‘owner’. These vague descriptors cannot be interpreted or classified into the NOC system. For example, an employee at the Bay could work in maintenance, food service, clerical, sales and service, or in management. An example of another question that would have benefited from more careful probing was language – responses included non-existent languages, e.g., Swiss (Switzerland has four official languages, none of which is properly called ‘Swiss’), and multiple responses to questions that required a single answer, e.g., language used most often at work. Another type of inadequate probing was a failure to collect the full complement of information on questions with multiple parts. For example, many participants who reported having accessed language training did not provide information as to type of training.
It is unclear why, in a study of second language acquisition, native speakers of English and French were surveyed so extensively. It appears that considerable resources were devoted to collecting data that would be of little value.
In order to efficiently collect data that will be useful in planning better language programming, a focus should be placed on questions that are pertinent to the issues being investigated. In this study, it appeared as though a number of disparate areas were covered. Not only were demographic and language questions asked, but citizenship questions that appeared to have little relevance to matters of language learning were also included. It is unclear why data were collected on citizenship matters (e.g., name of judge) in a study that was ostensibly conducted to gauge language development, particularly when Citizenship and Immigration Canada has other records on citizenship pass rates, etc. It would be a better use of resources to restrict the survey to questions that are directly relevant to the purpose of the study. Adding more to the study increases time, demands on participants and assessors, and overall cost.
Another example of an insufficiently focused question concerns language at work. Participants were asked which language they use most frequently, but there is no indication as to the nature of language use, for instance, the types of tasks required at work such as using formulaic language (e.g., a waitress uses the same phrases over and over) answering the phone, reading on the job, interacting with co-workers, training others, making formal oral presentations, writing correspondence and reports, etc.
Because of the way the question about current employment status was worded, it is not possible to determine which participants were unemployed versus not working by choice. This question could have been worded differently to elicit more useful information.
Some questions were ambiguous, such that participants may have had difficulty knowing how to interpret them. For example, when queried about language training, participants were asked whether they took LINC, fee-based, or high school/college/university courses. The latter category and fee-based instruction are not mutually exclusive. Moreover, LINC is sometimes offered at institutions that identify themselves as colleges. Concepts such as full-time versus part-time training and ‘continuing education’ are also highly problematic because of the wide range of interpretations that can apply to these terms. The only useful measure of amount of language training is number of hours of contact. Part-time attendance could entail a very small number of hours, or a very large number of hours per week.
On the other hand, there were insufficient questions about participants’ experiences before coming to Canada, such as previous education and prior occupation. Furthermore, the participants who reported having received language training should have been asked about their CLBA score when they were originally tested. Although some individuals might not have accurately recalled their score, most would be able to provide helpful information. The lack of this information makes it impossible to assess actual language progress among the participant group after their arrival in Canada. Had these types of questions been included, the usefulness of the data set would have been greatly enhanced, and stronger conclusions could have been reached regarding the effectiveness of the language training the participants received.
The data collection in Toronto and Vancouver was completed prior to the data collection in the other four cities. Subsequently, changes were made to the questionnaire, resulting in incompatible data files. In order to make direct comparisons across all cities, it was necessary for us to exclude some of the information that was added midstream in the data collection process.
Report a problem or mistake on this page
- Date modified: