Springe direkt zu Inhalt

Preisträgerin des Marie-Schlei-Preises 2020

Dissertation ohne Themenbindung

Dr. Tanja Kutscher

Zum Werdegang

  • Seit 1/2019 wissenschaftliche Mitarbeiterin im Arbeitsbereich Skalierung und Testdesign am Leibniz-Institut für Bildungsverläufe in Bamberg
  • 2011-2018 wissenschaftliche Mitarbeiterin im AB Methoden und Evaluation an der Freien Universität Berlin 
  • 2003-2010 Studium der Psychologie an der Freien Universität Berlin

Dissertation: Measuring job satisfaction with rating scales: Problems and remedies


Job satisfaction is an aspect of cognitive well-being and one of the standard indicators of quality of life. A job satisfaction measure is included in several national panel surveys. The assessment of job satisfaction with a precise and valid measure is a pre-requisite for obtaining accurate analysis results and drawing valid conclusions. However, an inadequately designed response format can impair the way respondents answer the questions, and there is reason to suspect that the 11-point rating scale standardly used in national panel surveys for assessing cognitive well-being could be a problem. Respondents may be overwhelmed by the large number of response categories and, therefore, cope with an increased response burden by using response styles (e.g., overusing particular response categories) and other types of inappropriate category use (e.g., careless responses or ignoring irrelevant or unclear categories). Consequently, data provided by panel surveys may be of reduced quality. Thus, the research in the present dissertation aimed first to investigate whether an 11-point rating scale is adequate for a valid assessment of job satisfaction, one of the relevant life domains. Due to the lack of evidence, the second aim was to examine the performance of mixed polytomous item response theory (IRT) models when applied to detect inappropriate category use under the data condition typical for panel surveys with a job satisfaction measure. The third aim was to study whether a rating scale with fewer response categories may be more optimal to measure job satisfaction. In addition, the fourth aim was to describe the personal profiles of response-style users by means of personality trait, cognitive ability, socio-demographic variables, and contextual factors. It is important to identify these profiles because a person’s use of a specific response style can occur consistently across different traits and rating scales and, therefore, is considered a type of disposition.
To examine the adequacy of an 11-point rating scale, we explored patterns of category use in the data on job satisfaction provided by the Household, Income and Labour Dynamics in Australia (HILDA) survey (first wave, n = 7,036). For this purpose, mixed polytomous IRT models were applied. The analyses showed that most respondents (60%) overused extreme response categories (e.g., adopted an extreme response style [ERS]) or the two lowest and two highest categories (e.g., adopted a so-called semi-extreme response style [semi-ERS]), whereas others demonstrated more appropriate response behavior (a so-called differential response style [DRS]). Moreover, all respondents ignored many response categories, especially those who exhibited the ERS and semi-ERS. These findings emphasize the limited adequacy of a long rating scale for assessing job satisfaction due to a large presence of inappropriate category use. Generally speaking, an 11-point rating scale does not allow one to assess fine-grained differences between respondents in their levels of job satisfaction, as intended by the developers of panel surveys. In contrast, this rating scale seems to overburden respondents with superfluous response categories and evoke response styles due to the difficulties they experience by determining the meaning of fine categories. To conclude, a rating scale with fewer response categories may be more optimal.
To address the second aim, a Monte Carlo simulation study was conducted. It included two models: the mixed partial credit model (mPCM; Rost, 1997) and the restricted mixed generalized partial credit model (rmGPCM; GPCM; Muraki, 1997; mGPCM; von Davier & Yamamoto, 2004). These models are suitable for detecting patterns of inappropriate category use. The latter model is more complex and includes freely estimated item discrimination parameters (but which are restricted to be class-invariant). In particular, the simulation study focused on identifying the required sample size for a proper application of these models. In addition, we investigated what information criteria (AIC, BIC, CAIC, AIC3, and SABIC) are effective for model selection. Analysis showed that both models performed appropriately with at least 2,500 observation. By further increasing the sample size, more accurate parameter and standard error estimates could be obtained. Generally, the simulation study revealed that the mPCM performed slightly better than the rmGPCM. Specifically, both models showed estimation problems due to low category frequencies, leading to inaccurate estimates. For the recommended sample size, both the AIC3 and the SABIC were the most suitable. For the large sample sizes (consisting of at least 4,500 cases), both the BIC and CAIC were effective. The AIC, however, was insufficiently accurate.
For the third aim, an experimental study with a between-subject design and randomization was conducted to compare the performance of two short rating scales (with 4 and 6 response categories) with that of a long rating scale (11 response categories) with regard to the presence of inappropriate category use and reliability (N = 6,999 employees from the USA). For this purpose, the multidimensional mixed polytomous IRT model was applied. Notably, the results from the simulation study were used at the preparation stage of this study (e.g., regarding the minimum sample size required within an experimental condition). Overall, when the rating scale was short, both the proportion of respondents who used a specific response style and the number of ignored response categories were reduced, indicating less bias in data collected with short rating scales. This finding confirmed the suggestion that some respondents use response styles as an adjustment strategy due to the inadequately large number of response categories offered. Interestingly, the same response styles were present regardless of rating scale length, suggesting that optimizing rating scale length can only partly prevent inappropriate category use. Apparently, a proportion of the respondents use a particular response style due to dispositions.
To attain the fourth aim, the personal profiles of respondents who used a particular response style were investigated with two datasets: (i) a small set of the potential predictors that were available in the HILDA survey (socio-demographic variables and job-related factors); and (ii) several relevant scales and variables (personality traits, cognitive ability, socio-demographic variables, and job-related factors) that were intentionally collected in the experimental study for this purpose. For both datasets, the assignment of respondents to latent classes indicating different response styles was an outcome variable.

The analyses were conducted using multinomial logistic regressions. Therefore, the findings obtained on the basis of the first dataset provided the response-format-specific characteristics of response-style users (for the 11-point rating scale). By contrast, the second analysis allowed to reveal general predictors that explained the use of a particular response style, regardless of rating scale length, whereas response-format-specific predictors explained the occurrence of a response style for a certain rating scale. Specifically, some of the general predictors found for ERS use included a high level of general self-efficacy and self-perceived job autonomy; for non-ERS use, as a tendency to avoid extreme categories, a low need for cognition was the general predictor, indicating that response styles can be caused by dispositions, and therefore they can hardly be prevented by optimizing the features of a rating scale. The predictors specific to a particular response format were then socio-demographic variables, cognitive abilities, and certain job-related factors, suggesting that profiles of respondents who used a particular response style vary depending on the rating scale administrated to collect data. Presumably, these groups of predictors primarily characterize respondents who are inclined to use response styles as an adjustment strategy due to an inadequately designed rating scale.
In sum, an 11-point rating scale was shown to have serious shortcomings, including a high proportion of respondents with response styles and many ignored response categories. Therefore, this rating scale is of limited adequacy for a valid assessment of job satisfaction (and other aspects of cognitive well-being). By contrast, the 4- and 6-point rating scales showed a superior performance with regard to the presence of inappropriate category use. These short rating scales were found to have fewer respondents using response styles and to include almost no redundant response categories. Thus, these shorter rating scales are more adequate for this purpose. Generally, shorter rating scales eliminated the inappropriate category use that is primarily measure-dependent. Nevertheless, the same response styles were present in the data, regardless of rating scale length, suggesting that stable dispositions may be another major cause of response styles. Furthermore, some of these personal characteristics were identified (as general predictors). For example, ERS use could be explained by a high level of general self-efficacy and self-perceived job autonomy. Therefore, any optimizing of the rating scale may not be sufficient to eliminate effects caused by the consistent use of response styles. In this case, statistical approaches of controlling the effects of response styles should be applied. A promising approach for dealing with inappropriate category use are mixed polytomous IRT models.

Muraki, E. (1997). A generalized partial credit model. In W. van der Linden & R. Hambleton (Eds.), Handbook of modern item response theory (pp. 153-164). New York: Springer.

Rost, J. (1997). Logistic mixture models. In W. van der Linden & R. Hambleton (Eds.), Handbook of modern item response theory (pp. 449-463). New York: Springer.

von Davier, M., & Yamamoto, K. (2004). Partially observed mixtures of IRT models: An extension of the generalized partial-credit model. Applied Psychological Measurement, 28, 389-406. doi: 10.1177/0146621604268734

Korrespondierende Publikationen

  • Kutscher, T., Crayen, C., & Eid, M. (2017). Using a Mixed IRT Model to Assess the Scale Usage in the Measurement of Job Satisfaction. Frontiers in Psychology, 7, 1998. https://doi.org/10.3389/fpsyg.2016.01998
  • Kutscher, T., & Eid, M. (2020). The Effect of Rating Scale Length on the Occurrence of Inappropriate Category Use for the Assessment of Job Satisfaction: an Experimental Online Study. Journal of Well-Being Assessment, 4, 1-35. https://doi.org/10.1007/s41543-020-00024-2
  • Kutscher, T., Eid, M., & Crayen, C. (2019). Sample Size Requirements for Applying Mixed Polytomous Item Response Models: Results of a Monte Carlo Simulation Study. Frontiers in Psychology, 10, 2494. https://doi.org/10.3389/fpsyg.2019.02494