Standards for acceptable levels of reliability vary across experts, but some general guidelines can be provided (Clark & Watson, 1995; Nunnally & Bernstein, 1994):
The preceding standards for evaluating reliability are reasonable for multi-item scales that contain 8-10 items or more, particularly if the rating for each item is on a Likert scale with a 1-4 or 1-5 or larger scale.
Scales with few items or with more restricted rating scales (e.g., dichotomous scales) will likely have fewer points of discrimination among participants and this may lead to lower levels of reliability (e.g., reliability between .50 and .60). Such scales may still be usable for research purposes, but the “proof is in the pudding” (i.e., usability of such scales will depend on whether they are strongly related to other measures as hypothesized.
In addition, measuring devices to be used in high-stakes decision-making, such as deciding whether a person has mental retardation, should have very high levels of reliability, preferably above .95. Traditional individually administered tests of intelligence tend to attain this level of reliability.