Extracting Configurations of Values Mixing Scores From Experts and Ignoramus Using Bayesian Modeling

The article proposes a method for producing configurations of values in firms. Values have an impact in the long-term survival of businesses and guide managerial decision-making. The method produces cross-comparable latent rates of configurations of values. Data comes from a pool of 37 firms rated by both experts and ignoramus. By using Bayesian inference the researcher can tune the expert rater bias. This generates robust estimates using a clear, overt and systematic procedure. The model is compared with the mean of raters. It produces lower ratings for Economic-Pragmatic values and higher ratings for Ethical-Social values.


INTRODUCTION
Would you invest in a company that produces an environmentally friendly product following strict social and ethical standards, but whose profitability is weak? Would you work for a company that generates profits at the expense of using its employees as a mere labor force, with no long-term plans? Would you buy products of a company that sells them very cheap, but at the expense of environmental costs? The way in which companies internalize the motives, aspirations, and values of their environment (investors, employees, and consumers) reflects its own configuration of values and it allows to predict its long-term survival. A non-balanced configuration of values, namely, one that puts economic profit above all and discards any social concern or any respect for its employees, is problematic for the management of a long-term project. The rational is that certain configurations of the values can help predict the sustainability of businesses in the long run [1][2][3]. Extracting the configurations of values that are representative in a company is a convenient task for any manager, in order to both address day-to-day issues and to project mid and long-term evolution of the firm.
The objective of the article is to present a system to extract configurations of values for several firms deeply analyzed in the context of a large-scale European research project on user-innovation and sustainability 1 .The process, however, must deal with two features that are relevant to allow its generalization: cross-comparison and expert weight. First, in order to use configurations of values as explanatory variables in other contexts (say, to predict business performance, business survival, business alignment with societal values, etc. . . ), it must be assured that cross-comparison between cases is possible and rigorous [4,5]. Second, when incorporating information from several sources, it must be possible to weight its value between sources that are experts on specific cases and sources that have only superficial knowledge about them. In this case, it must be assured that expert elicitation and ignoramus are properly weighted [6].
The article proposes a method that extracts configurations of values from firms by combining assessments performed by several individuals. By doing so, the model makes an empirical contribution to measurement research by combining the following features: First, values are not directly measurable, but only guessed, so the model is explicit about the latent nature of such construct. Second, the method overtly deals with information provided by experts, to whom more reliability is assigned a priori vis--vis information provided by ignoramus, whose input is also considered and weighted accordingly. Third, the method uses a "brute force" approach to extract the signal contained in the raters of several individuals. It relies on having rates from several sources (even if they are less reliable than experts) than a single (expert) source. In this sense, the potential lack of precision, the subjectivity of the experts, and the cross-expert comparison is weighted using raters that, without being experts, provide both the anchor for the cross-comparison between cases and the necessary amount of data. To sum up, using lots of noisy data, conveniently and systematically treated using a model, provides a way to generate robust estimation for a measurement scenario, in this case finding configurations of values in a firm.
The paper proceeds as follows: The first section presents the case studies that have been rated and the second how the task was performed. Then follows a presentation of the data and the procedures to extract values. The results of the procedures follow, along with a discussion of their implications. Conclusions are provided at the end.

CASE STUDIES
The case studies that comprise the corpus of the EU-InnovatE project are business selected employing a multiple case study design [7] and theoretically sampled following several criteria: • The innovation creates economic plus social and/or ecological value, thereby enhancing sustainable lifestyles. • The process, that includes invention and commercialization, is driven by a single user or a group of users. • All phases of the user sustainability innovation process are covered.

VALUE RATING AND ASSESSMENT
Data was collected using a web application (see the Appendix in Data Sheet 1 for a screenshot). Every rater encountered a webpage with a list of 30 values. The 30 values come from a theoretical model called Management by Values developed by Dolan et al. [1] and are organized in tree groups (called axes), namely Economic-Pragmatic, Emotional-Developmental, and Ethical-Social. This division has been tested and validated using several empirical methods [8,9]. Recently, it has been used to assess differences in the culture and values of public service organizations in old and new EU member states [3]. The values and the axis they belong to are presented in the Appendix in Data Sheet 1.
The raters were asked to do the following: Pick values from the original list and give them integer numbers, assigning a total of 10 points. So if you give 8 points to value A you can only assign the rest of the 2 points either to a single other value (B) or to two different values (C and D, with 1 point each).
The raters did not know which whether each specific value was in the Economic-Pragmatic, the Emotional-Developmental or the Ethical-Social axis. Given the long list of values (30 individual values grouped in 3 axes), researchers assigned values to appear randomly in the screen every time the browser was refreshed, in order to avoid favoring or penalizing values at the top or the bottom of the list.

Combination of Raters
There were 2 different raters for each of the case study: • Ignoramus: 4 fixed raters (JB-08, RF-13, BN-02, and LT- 10) gave the anchor of the baseline values for comparison between cases and rated all the cases. • Expert: One of the raters is the original author of the case, if reachable. If not, a colleague that participated in the writing of the original report. There were 13 raters of the original case studies, as some of them rated more than one case.

Structure
As a result of this data collection process a dataset with a total of 1,305 observations has been obtained, comprising rates for each of the specific 30 possible values asked, for different raters and case studies. With this data the values of the individual values have been aggregated in the tree groups of values (Economic-Pragmatic, Emotional-Developmental, and Ethical-Social) adding the scores given to each value to the respective group of values. For instance, a rater that has given 2 to competitiveness, 1 to economic success and 7 to creative energy is aggregated into 0.3 for Economic-Pragmatic (as the addition of 2+1 of the two values in that group) and 0.7 to Emotional-Developmental. So the data to be treated comprises a total of 555 observations in 37 case studies, where each case study (c) contains 15 data points [5 raters (r) * 3 axes of values (g)], where the sum of the percentage of axes of values within case study and rater must add up to one. Table 2 shows a sample of observations for illustration purposes.

Descriptive Statistics
The most basic descriptive distribution of the data is shown in Figure 1. It shows the distribution of case means by axes of values. The means are calculated using the 5 different rates that rate each case (4 global raters and 1 variable, which is the original author of the case). SR-31 is the case with a highest rated axis, at almost 3/4 of the configuration of values in the Economic-Pragmatic pole. The lowest band is mainly coped with Emotional-Developmental cases (ME-18, U-34, SG-28, SA-32), but also one very low configuration of Ethical-Social values at CY-3.

Inter-Rater Agreement
Inter-rater agreement can not be assessed using traditional correlation-based approaches, as the cases and axis are not rated independently. That is, correlations assume that each observation is independent from the rest, but in this case the raters were asked to provide configurations. So giving more weight into one value meant that less weight was provided to a different value. Therefore, an agreement rate is defined simply as one minus the standard deviation of the 5 rates [Agreement c,g = 1 − sd(rate c,g )] between raters for each case study and axis of values. This measure is directly interpretable in terms of percent of agreement. So a 0.9 agreement means that, on average, for that case and that specific axis of values the configuration varies by (1 − 0.9) 10 percent between raters. Figure 2 presents the agreement rates of each case and axis of values. It shows that most of the cases have agreement rates above 75 percent, with some cases having one of the groups of rates between 65 and 75 percent. The lowest agreement happens on the Economic-Pragmatic configuration of ZL-36, where 60 percent of agreement is reached.

ESTIMATION OF CONFIGURATIONS OF VALUES
The extraction of the concrete configuration of values for each case study is performed using a latent variable model. Equation (1) describes the model used to extract the distribution of rates for each axis of values in each case, where y c,r,g are the aggregated responses of each rater r to a case c and axis of values g (column "percent" in Table 2). A visual representation of the model and the relationships between parameters is also depicted in Figure 3.
The response (that is, each of the rater's sum of values that belong to the Economic-Pragmatic axis, to the Emotional-Developmental and to the Ethical-Social, shown in Table 2) is modeled using a scaled and shifted T distribution (essentially, a  T distribution with centrality and dispersion parameters that are allowed to vary from the mean zero and variance one solution of the standard T distribution) that allows to control for extreme cases. Therefore, this is a robust model that gives less weight to rates that depart substantially from the average. The mean of the response is the addition of the latent configuration of values (which represent the main parameters of interest, θ , the unobserved true value of every case in each group of values) and a rater bias (accounted by ω). The latent configuration of values θ is modeled using a Dirichlet distribution (the multivariate generalization of the beta distribution), which is a functional form that restricts the sum of its parameters to one. This is convenient because the restriction of the configuration of the values is that their sum has to be equal to 1.
The ω parameters control the rater bias, which is extracted from the model based on the comparison of what every rater does in general, compared to the rest of the raters. However, it is expected a priori that the expert raters will be more accurate in their evaluations of the cases (accounted with different accuracies captured by λ), so the prior standard deviations of the rater bias varies depending on whether the rater is an expert or an ignoramus.
The ρ parameter is the weight of the certainty of experts over ignoramus. But it can also be seen as a sort of tuning parameter to assess the validity of different model specifications giving different weights to the experts over ignoramus, with the following possibilities: • If it is given the prior in the model (its log is distributed uniformly), it acts as an evaluator of what the data tells about the strength of the quality of experts over ignoramus. • But also ρ can be given a more restrictive prior that incorporates specific knowledge about the subject in the specific setting that the model is applied. For instance, a bounded normal centered on more or less than one to give more weight to one group of raters over another. • In the extreme, ρ can be fixed to any value that the researcher wants. For instance, when ρ = 1 the researcher assumes that no priority must be given to any specific rater. Also running the model with different fixed values would allow to validate different model specifications.
So, to sum up, the validity of the model can be judged by using an internal parameter to the model (ρ), that acts like a meta-parameter and depending on its prior specification it either estimates the relationship between experts and ignoramus (as shown hereafter) or evaluates it. Table 3 summarizes the meaning and interpretation of the parameters to be estimated.
The model specification implies performing inference for a total of 185 parameters: 37 * 3 = 111 latent configurations of values (θ ), 17 * 3 = 51 rater biases (ω), 3 error components (σ ) , 3 degrees of freedom for the robust model specification (ν) and 17 for the error components of the rater biases (σ ω ). Recall that the total number of data points is 555, which makes a ratio of three data points observed for every parameter to be estimated (555/185). ω r,g Rater bias. It is the biased produced by the fact that every rater tends to over-rate or under-rate certain dimensions.
σ g Dispersion of the dimensions. Can be understood as the average variation that every dimension of values shows when the raters perform their task.
ν g Degrees of freedom of the observed scores. Can be interpreted as the likelihood of observing extreme rates in the data.
σ ωr Dispersion of the raters. Can be understood as the average difference between raters's rates. It allows to account for the fact that for some raters we are more or less certain about their bias.
λ Dispersion of raters' types. It is the different dispersion that the two groups of raters (ignoramus vs. experts) have.
ρ Expert's over ignoramus certainty. Parameter that assesess up to which point we can be more confident in the rates provided by the experts than by the ignoramus.
A model with the previous specification (hierarchical structure of the hyper-parameters) is difficult to estimate using classical frequentist approaches, but fits naturally in the Bayesian paradigm, where prior distributions can account for the nested structure of the hyper-priors and the low ratio of data over parameters. Therefore, Bayesian inference is used to extract the posterior distributions of the parameters of interest. More specifically, the parameters are obtained using MCMC (Markov chain Monte Carlo methods) using a Gibbs sampler. JAGS [10] has been used for the estimation (code available in the Appendix in Data Sheet 1). The chains There is no evidence of non-convergence of the series according to the Geweke test or the potential scale reduction factor [13,14] (see the online appendix for the full report of convergence diagnostics and visualization of posterior distributions of all the parameters).
Bayesian models require the researcher to provide prior distributions of the parameters of the model. Equation (1) shows that in general weakly informative priors were assigned. The only priors worth mentioning are those assigned to the θ distribution. In this case, the prior D(1, 1, 1) implies that the researcher expects that the configuration of values will be equally spread across axes of values (1/3 for each of the 3 axes of values), with an interquartile range between 13 and 50 percent for each of the axes. The remaining parameters have Uniform (U), Normal (Gaussian, N), and Gamma (G) prior distributions.  Figure 5 shows the mean and credible intervals (90 and 95 percent, with thick and thin lines, respectively) of the rater biases, ω, which represent the amount of bias that is expected for each rater compared to the latent value in each of the cases and axes of values.

Rater Bias
The figure shows that global raters (those who rate all cases) have substantially lower uncertainties in their biases, as their behavior is extracted from many more observations. On the other side, raters such as MK-11, who only rated one case, show a much more uncertain bias.

Sensitivity Analysis
A sensitivity analysis of the role of priors is shown in Figure 6. The Figure shows the comparison between the fixed prior with another model with hyper-priors, letting the model decide whether the rates of the cases in the "Economic-Pragmatic" start from a higher or lower probability. If hyper-priors are used the model tends, not surprisingly, to shrink all rates toward the mean of the axis. This is a feature that may be useful for explanatory models, but in this case, where the objective is to measure the  Compared to a model-based approach that takes into consideration rater bias or proper configuration of values adding up to one, the plain means tend to produce higher values for the Economic-Pragmatic axis of values and lower values for the Ethical-Social axis. Figure 7 shows the comparison of the cases estimated by the model (in the vertical axis) and the averages of raters (in the horizontal axis), for the three groups of raters. Figure 8 shows a comparison of the rates between the modelbased approach weighting several raters and the single value provided by the original authors of the report. In this case, the original author tends to underestimate the rates for each of the axes when the model-based approach gives it low rates as well, while at the same time overestimates the rates when the case study has a high value in that axis of values. In other words, relying only on the original author of the case study would imply underrating cases with low values and overrating cases with high values. This effect is specially strong in the Emotional-Developmental axis when original authors have rated those cases as high. The implications of this finding are important, as they highlight the fact that when an expert is not accurate, it systematically not accurate (not randomly). That is, the ratings of the expert are higher in the Emotional-Developmental axis very likely because of the emotional involvement with the subject being rated, as this is not happening when rating analytical dimensions. Weighting  by different raters allows us to overcome the problem that experts are humans also, and have feelings and emotions, and even when they act as experts they are not emotional-free.
Finally, on sensitivity analysis with regards to initial model specifications. Figure 9 shows the estimated ratio of ρ, the tuning parameter for expert weight over ignoramus. It shows that although experts are assigned to have a stronger concentration of mass around zero bias, the difference with ignoramus is not specially acute. In fact, the prior standard deviation for the biases is 0.069 for ignoramus and 0.043 for experts, making a ratio of 1.6, which means that for each expert rate it takes a bit more than one and a half ignoramus over making the same rate to overcome its judge. In other words, every expert is worth 1.6 ignoramus in this context. Based on this, the value of expert judgement is not highly appreciated, which stresses again the idea that having less than two ignoramus performing a rate of a case will provide more accurate information than a single expert.

CONFIGURATIONS OF VALUES
Another way to look at the substantial results of the configurations of values is to extract the means and generate the culture of values. The tri-axial model of MSIV proposes    either a focus on sensitivity (upper-left area) or on innovation (upper-right area), but the survival area is the least variable. This is not surprising, but highlights an important aspect for managers and decision-makers: the fact that for a firm to succeed in the long run some sort of combination of Economic-Pragmatic and Ethical-Social values is helpful. And only afterwards the firm can seek to position itself as an either leading business in innovation or in sensitivity, but not both dimensions at the same time.

Classification
Based on the configuration of cultures in the case studies, a cluster analysis has been performed. Figure 11 shows the dendrogram of the distribution of case studies using a hierarchical clustering method with euclidean distances between the configurations of values in the cases. The figure shows that there are clearly two different groups according to their organizational culture. The dendrogram can be interpreted as follows: two cases that are very similar will be linked by a structure that has a very small vertical distance. For instance, MS-19 and OL-23 have a tie with a very small vertical distance. On the contrary, when vertical distances are high, this can be read as a lot of difference between the two units or groups of units. One example of this is at the left-hand side of the dendrogram, where the cluster between SR-31, CY-3, and JB-13 is quite different from the other group of cases (formed by a cluster of two big groups). Figure 12 shows the distribution of variables that identify the cases and variables that define the clusters when the number of clusters is set at the number of clusters suggested by a modelbased approach based on a comparison of several different cluster analysis models based on EM for a parameterized Gaussian mixture models [15]. Specifying 5 clusters maximizes the Bayesian Information Criterion.
According to the distribution of values and characteristics, the 5 clusters in which case studies are organized naturally are defined by the following: 2 • #1: High survival, mean sensitivity, and mean innovation, with no specific preference for a concrete domain or type of firm.
Most of the cases belong to this cluster. • #2: Low or mean survival, mean to high sensitivity, and mean innovation, mostly in the food and living domains but no preference for type of firm. • #3: Few cases, with mean survival, low sensitivity, and very high innovation, either in the energy or mobility domain, and a slight preference toward entrepreneurship. • #4: Similar to #1, but with low innovation; mostly energy or food cases, with no mobility, and mostly on entrepreneurship. • #5: Similar to #3, but instead of low sensitivity and high innovation this one has very high sensitivity and very low innovation, with cases on the food and living domain and a slight inclination toward entrepreneurship.

CONCLUSIONS
Assessing which values are more present in a specific institution (in this case, firms and entrepreneurial projects that contain innovations addressed toward sustainability) is a challenging measurement problem that involves dealing with expert rating, subjectivity and cross-comparison. The proposed model-based approach takes into account several of those challenges by providing an overt and systematic set of assumptions, namely: by nature values are not obvious but latent, so a measurement model is more suited; rater bias may be modeled differently for experts than for individuals with no previous knowledge; it is possible for the researcher to tune the expert to ignoramus weight to accommodate the fact that raters do not always rate something that are very familiar with. The concrete implementation of the model has lead to a result that produces lower values for Economic-Pragmatic values and higher Ethical-Social, when compared to the plain mean of the raters. Also, it tends to provide lower weight to extreme values given by the experts when those are not aligned with the rest of the ignoramus raters, specially in the Emotional-Developmental axis. The final configuration of cultures in each of the analyzed firms leads to four different clusters of configurations, from which the most persistent is the survival culture. In addition to this, some firms have very high values on innovation or on sensitivity, but those features do not go together, and most of those cases are present in entrepreneurship projects, not in big firms.

AUTHOR CONTRIBUTIONS
The author confirms being the sole contributor of this work and has approved it for publication.

FUNDING
This project has received funding from the European Union's Seventh Framework Programme for research, technological development, and demonstration under grant agreement no 613194.