User testing is qualitative research. The main purpose of user testing is to find and remove problems. Given this I have always assumed that the removal of negatives is good for the design, even if we cannot know what the true frequency of problems are in the wider population. That is, we know difficulty in the lab will mean difficulty in the real-world even if we don't know whether it's 20% or 60% of people that will have that problem.
In terms of positive results I am less sure about using this assumption. I have reported positive findings to show balance and to highlight what is working well in order to encourage teams and maintain successful features (as www.usability.gov argues too). But I get increasingly uncomfortable when things such as colour preferences are extrapolated to the wider user population.
When reading about what makes a qualitative study, showing rich specific content in a specific context is its key strength. The general argument is not we seek to generalise to a wider population but we develop, or generalise, to a theory (Bryman, 2008; Creswell, 2009, Grbich 2007).
I would argue that the temptation to generalise to a wider population is inherent. In fact clients would argue why on earth should I do any research if I cannot generalise to my wider customers. Why do a focus group when we cannot generalise outside the session?
Williams (2000 in Bryman 2008), argues that 'moderatum generalizations' are allowable: linkages can be made to similar groups. For example behaviour of football hooligans in one football club is related to other case studies of different football hooligans.
At a broader level I would assume that one of the points of creating theory in qualitative studies is that is generalisable. I find the idea of 'theoretical sampling & saturation' interesting: you sample, collect data and analyse until there is only repeated information (from Grounded Theory - Glaser & Strauss 1967, Strauss & Corbin 1988). Given this if we have consistent analysis of a positive interaction then we can assume it will work in general. We are not making a conclusion about frequency but a deeper more abstract judgment, for example the button 'affords' clicking through its visual design and therefore it is a good design feature. However to what level do we really theorise in user testing?
So where does this leave us? Feedback on interaction is difficult to get via other methodologies. I still want to report interactions that are working well for a design otherwise I fear the design will be paralysed within continual redesign from scratch. We do violate the principles of generalisation - we are assuming if everyone in the session understands the check-out process, the wider user population will too.
However with other types of feedback, such as preferences, perhaps we should not take a 'some information is better than none' approach. User testing can create hypotheses that should be further examined using other methods such as A/B testing, surveys or web analytics. For example if 6 out of 8 people liked the content tone this is an indication it could be working but it is not a definitive 'yes'. This is where I'm a fan of triangulation: using multiple data points and methodologies to get a clearer idea of the state of the world.
I haven't decided what I think about this topic - how can we be pragmatic and helpful without being misinforming? Definitely welcome discussion
References:
Bryman, Alan (2008). Social Science Research Methods. 3rd Edn. Oxford University Press.
Creswell, John W. (2009) Research Design. Sage Publications
Grbich, Carol (2007). Qualitative data analysis. Sage Publications.