Research themes in big data analytics for policymaking: Insights from a mixed‐methods systematic literature review
2021; Wiley; Linguagem: Inglês
10.1002/poi3.258
ISSN1944-2866
AutoresArho Suominen, Arash Hajikhani,
Tópico(s)Data Quality and Management
ResumoThe use of big data and data analytics are slowly emerging in public policy-making, and there are calls for systematic reviews and research agendas focusing on the impacts that big data and analytics have on policy processes. This paper examines the nascent field of big data and data analytics in public policy by reviewing the literature with bibliometric and qualitative analyses. The study encompassed scientific publications gathered from SCOPUS (N = 538). Nine bibliographically coupled clusters were identified, with the three largest clusters being big data's impact on the policy cycle, data-based decision-making, and productivity. Through the qualitative coding of the literature, our study highlights the core of the discussions and proposes a research agenda for further studies. 大数据和数据分析的使用已逐渐出现在公共决策中,对此需要展开系统性综述和研究议程,聚焦大数据和数据分析对政策过程产生的影响。本文通过文献计量分析和定性分析,分析了公共政策中大数据和数据分析这一新兴领域。本研究包括了从SCOPUS中选取的科学刊物(N=538)。识别了9个文献耦合簇,其中最大的三个簇分别为:大数据对政策周期的影响、基于数据的决策、生产率。通过对文献进行定性编码,我们的研究强调了探讨的核心,并提出了进一步的研究议程。 El uso de macrodatos y análisis de datos está emergiendo lentamente en la formulación de políticas públicas, y hay pedidos de revisiones sistemáticas y agendas de investigación que se centren en los impactos que los macrodatos y el análisis tienen en los procesos de políticas. Este artículo examina el campo naciente del big data y el análisis de datos en las políticas públicas mediante la revisión de la literatura con análisis bibliométricos y cualitativos. El estudio abarcó publicaciones científicas recopiladas de SCOPUS (N = 538). Se identificaron nueve grupos acoplados bibliográficamente, siendo los tres grupos más grandes el impacto de los macrodatos en el ciclo de políticas, la toma de decisiones basada en datos y la productividad. A través de la codificación cualitativa de la literatura, nuestro estudio destaca el núcleo de las discusiones y propone una agenda de investigación para estudios posteriores. Big data and data analytics have been seen as augmenting knowledge, ultimately leading to better decision-making. Arguments such as that the broad-based use of big data and data analytics will lead to the end-of-theory speak volumes about our expectations of big data and data analytics technologies' transformative power. While industry has been leading the way to test big data and analytics, public actors have been slower to engage (Poel et al., 2018), despite an equal opportunity for big data and data analytics to augment the public policy process. Utilizing big data and data analytics has become a near necessity due to our increasing capability for creating and collecting data at an extraordinary rate. The terms "big data" and "data analytics" have been among the buzzwords of recent years, leading to an upsurge in research, industry, and government applications (Zhou et al., 2014). The increased interest in big data in public policy can be seen in the scientific literature Figure 1, highlighting the increase in big data and analytics related literature. We also see public organizations increasingly engaging with big data analytics to solve challenges like the sustainability crisis and pandemics.1 Scholarly discourse has highlighted case studies and narratives on implementing big data and data analytics in the policy process. However, the literature lacks a systematic view of the current state of big data and data analytics in public policy, and there are identifiable research gaps (Desouza & Jacob, 2017). RQ1. What are the thematic communities of big data and data analytics literature concerning public policy-making? RQ2. What are the research questions emerging under each of the thematic research communities? Our study adopted a mixed-method systematic literature review approach based on a robust empirical bibliometric analysis followed by a qualitative analysis of the core documents to answer these questions. Using a well-established bibliometric method, bibliographic coupling, we identified thematic differences within the literature, and here, we highlight points of departure from the extant literature. The bibliometric analysis was, in turn, used as a basis for the qualitative analysis of the core literature, which is used to propose a research agenda. We find nine contemporary research communities addressing different aspects of big data and data analytics in public policy. While these communities have significant overlap, our analysis identifies them drawing from different theoretical foundations. Moreover, we demonstrate three larger research strands taking different vantage points, namely building strategic capability, data-based decision-making and productivity increases. Finally, our work proposes a research agenda focusing on the role of strategic capability, data-based decision-making, how to address expectations for better services while simultaneously increasing productivity and how to leverage policy analytics and empiricism. Our results offer scholars in public policy a vantage point to the theoretical foundations of research in big data and data analytics in public policy-making. We also draw from the identified communities to highlight emerging research themes that can guide research forward. For policymakers, our results highlight the on-going scholarly debate that focuses on addressing critical issues in the adoption of big data in public policy-making, namely capability building and the extent of data-based decision-making. This article will proceed as follows: next, we review the central elements of big data in policy-making. This is followed by a description of the data and our mixed-method approach. Finally, the empirical results are described and followed by a discussion to make sense of the research themes emerging from the analysis. "Big data" is a general term used for the process of gathering massive amounts of data from different sources. Sources can include human-input data but also includes data from sensors or different types of monitoring systems that create process data while running. It is clear that we are accumulating data at a never before seen rate. Already, in 2014, the pace was staggering, with 90% of the world's data being collected during the prior 2 years and 2.5 quintillion bytes of data added each day (Kim et al., 2014). Having access to massive amounts of data has enabled significant innovation in both the public and private domains. Looking at companies like Google and Amazon, with their innovation of new services for consumers, or at the recent ability for doctors to detect cancer cells more precisely thanks to massive training data about what a cancerous cell is, we can see that we are very much on the cusp of creating a broad utility of big data and analytics. This has been seen as a shift in the Industrial Revolution's magnitude (Richards & King, 2014) and has been widely hyped in business (Margetts & Sutcliffe, 2013). That said, public policy is not at the forefront of the use of big data and data analytics in decision making (Kaski et al., 2019; Poel et al., 2018). This nonadoption is due to multiple factors limiting these technologies' utility (Malomo & Sena, 2017). The ever-increasing amount of data offers possibilities for discovering new relationships and inferencing a multitude of problems. However, this comes with new challenges involving reproducibility, complexity, security, and risks to privacy and a need for new technology and human skills. This is very much the case in public policy, where we need to clearly identify where big data can add value in an ethical and trustworthy manner. In a review, Giest (2017) highlighted three underlying factors to consider. First, institutional capacities have a significant role in the use of big data in public policy, producing solutions that can enable users to easily interact with data while also taking into account the siloed data structures in the public domain. However, we know from previous research that siloed structures are an important limiting factor for public policy utilization of big data (Malomo & Sena, 2017). Second, hand-in-hand with big data comes the broader digitalization of public services. Digitalization allows for mediums to interact with big data but also enables the creation of new data. There is, however, evidence that digitalization changes the interactions between citizens and public officials and requires new skills from both parties. Third, big data information will have an impact on the policy cycle. Studies have found that there has been limited progress in taking advantage of big data and analytics (Poel et al., 2018) because it requires a significant change in the policy cycle (Höchtl et al., 2016). Giest (2017) highlights two issues, the substantive role and the procedural role of big data in policy instruments. Procedural activities focus on regulatory activities, such as enabling open data, while substantive actions relate to collecting data for enhancing, for example, evidence-based policy making. Capacities, digitalization, and the role of big data in the (substantive and procedural) policy cycle are core to digital-era governance and evidence-based policy making. In this, it is important to note that policy-makers are not a homogeneous group, and policy cycles vary. Thus, the objectives of analytics throughout the policy cycle vary significantly (Daniell et al., 2016) whether or not we approach the policy cycle as separate discrete stages (Jann & Wegrich, 2007), and it has been shown that big data analytics, when used more in some policy stages than in others, notably improved government transparency, policy evaluation, foresight, and agenda setting (Poel et al., 2018). This should be reflected against findings that data analytics have been politically significant in all policy cycle stages (Van der Voort et al., 2019). To overcome the challenges, Poel et al. (2015) highlighted multiple topics that must be addressed to enable capacity building, digitalization, and data integration into the policy cycle. These are (1) a skills gap, (2) reduced transparency due to data analytics, (3) sources and tools, (4) standardization of methods and tools, (5) linking of policy experiments with impact assessments, and (6) enabling policy-makers to be informed about the tools that are developed and piloted. The highlighted themes give context to the issue of big data in policy. While we see the significant impacts being created by the use of big data in policy making, along with the subsequent adaptation of data analytics, we need to better explain and make transparent the utility and complementarity of big data-driven analyses for the policy cycle (Vydra & Klievink, 2019). The challenge highlighted by Poel et al. (2015) and Giest et al. (2017), is also reflected in Pencheva et al. (2020) and Ingram (2019). Both note that big data in public policy has focused more on the "techno-rational factors," dismissing the importance of interaction with the policy process. We know that technology adoption is dependent on the perceived usefulness by the user (Venkatesh & Davis, 2000) and that there is scepticism towards the use of big data and data analytics is public policy-making (Guenduez et al., 2020). This can be the result of a mismatch in practise and expectations. Durrant et al. (2018) show how there is a aspirational motivation to the use of big data and data analytics, not reflected by its everyday utility. While we know that by employing data drive approaches, there is significant potential for anticipatory governance, interaction among stakeholders is the key to draw value from big data and analytics (Maffei et al., 2020; Starke & Lünich, 2020). While the increased stakeholder involvement does not protect from big data and data analytics policy-making creating hard to detect inequalities (Giest & Samuels, 2020; van Veenstra et al., in press). However, engaging with a large pool of stakeholders in the public policy-making process will increase the complexities of adopting big data and data analytics (Janssen et al., 2017). In addition to stakeholder interaction, the ability to build capacity and also evaluate deficiencies is important (Okuyucu & Yavuz, 2020). Building capacity should not be merely seen as the technical capacity, while that is also important (Poel et al., 2018), but a more holistic capability to integrate big data and analytics into the policy cycle (Höchtl et al., 2016). Policy-making organizations, while being exceptionally technically capable, can be in a situation where the benefits from big data and data analytics remain small due to "applications" not fitting "their organizations and main statutory tasks" (Klievink et al., 2017). This to say that when we talk about big data and data analytics capabilities in public policy-making, the literature focuses on technical issues but also on the ability of big data and data analytics to produce policy-relevant applications. While we see an increasing and diverse set of research addressing different challenges of big data and data analytics, the current body of literature lacks holistic research agendas (Desouza & Jacob, 2017) addressing the issues highlighted from practice by Giest (2017) and Poel et al. (2015). While we can note emerging fields such as policy analytics (De Marchi et al., 2016; Tsoukias et al., 2013), there is a need to better understand the theoretical grounding and research gap of big data and data analytics in public policy-making. This study's methodological approach was based on a mixed method of quantitative and qualitative analyses of the bibliometric data and the publication's content. The selected four-step mixed-methods approach, described below, enables a holistic approach to comprehending the current state-of-the-art and allows us to propose an agenda for going forward. The first phase focused on retrieving the sample of relevant articles and their bibliometric data for analysis. The second phase involved the bibliometric analysis of the retrieved data, which was performed by analysing descriptive statistics, bibliographical coupling, network analytics, and community detection. By gaining a comprehensive view of the more extensive body of literature, we could implement more filtering process based on eigenvector centrality to get a shortlist of papers for the next phase. The third phase, qualitative analysis, continued the process with an in-depth review and coding of the articles' full text. Finally, in the fourth phase, Synthesis, we draw insights from the MAXQDA coding analysis and reporting. This four-phase process is shown in Figure 2. The data used in this study were retrieved from the Scopus database. Scopus is Elsevier's abstract and citation database that has over 1.7 billion cited references dating back to 1970. A central aspect of the quality of the results is that the query used to search for relevant articles was correctly designed. The study focuses on public policy-making and big data and data analytics. The study's scope is relatively narrow, focusing solely on publications that address policy-making and the policy process. The decision of the scope excludes articled that focus on, for example, big data or data analytics, but lack the specific aspect of policy-making. This was the key inclusion criteria of articles into the data set. To focus on this specific scope, we used an iterative approach where multiple search strings were tested, and after each search, the abstracts of the 10 most cited articles and the 10 most recent articles were reviewed to understand if the query results reflected the objectives of the study. In practice, the process started with a seed query of "big data" or "data analytics" and "public policy." The query results were reviewed to estimate which articles focused on big data or data analytics and policy-making. These articles were reviewed to see if new terms emerged through the titles, abstracts, and keywords that needed to be included in the analysis. The process adjusted based on a subjective evaluation of the number of false-positives in the 10 most recent and 10 most cited publications and the number of articles retrieved. This method of short-listing the important literature is known as the snowball method, and the process includes consulting the bibliographies of the key documents to find other relevant titles in the subject (Jalali & Wohlin, 2012). After multiple tests of a comprehensive query that also limited the number of false-positives, we downloaded the metadata for 538 documents. These documents were retrieved using the query "public policy," "policy analysis," "policy making," or "public administration," with the terms "big data," "data analytics," or "automated decision-making" in the title, abstract, or keywords of the document. To analyze the literature, we used the well-established bibliometric method of bibliographical coupling. Bibliographical coupling allows for analysis of the publications' shared intellectual background (Kessler, 1963), highlighting contemporary research (Youtie et al., 2013). It is an approach to analyzing the shared theoretical background of scientific publications where the link between documents is calculated by the number of references the two documents share. Kessler (1963) elaborates, "A single item of reference shared by two documents is defined as a unit of coupling between them," and if multiple references are shared, the weight of the coupling increases. Bibliographical coupling is able to highlight hot topics (Glanzel & Czerwon, 1996) and links documents with a similar research focus (Jarneving, 2007), ultimately creating a "contemporaneous representation of knowledge" (Youtie et al., 2013). This approach has been used in several research papers to form the basis for research agenda building (Suominen et al., 2019; Yuan et al., 2015). Using the retrieved publication metadata, the VOSviewer tool (van Eck & Waltman, 2009) was selected to calculate bibliographical coupling weights for all the documents in our data set. VOSviewer is a free tool used for bibliometrics and was selected due to the exports available in the software allowing for deeper network analysis in Gephi. The SCOPUS data export was used as an input to the VOSviewer. During the analysis process, we selected documents as the level of analysis, minimum number of citations for a document was set to zero and the full set was selected for the analysis. The full counting method, which assigns each researcher with full credit of one publication rather than a fractional share per the number of authors, was used for the calculation method. Finally, we accepted VOSviewer default to keep the most extensive set of related items, which limited the analysis to the largest, by node, subgraph created by the bibliographical coupling analysis. This limited the analysis to 332 documents. Bibliographical coupling analysis of a data set creates a graph , formed by , a set of nodes, and , a set of edges joining nodes. As we calculated the link weight between each publication node , we created a simple unidirectional graph. This graph data, produced with the VOSviewer tool, was imported to Gephi because it allows for more detailed visualization, network measure calculation, and community detection. Further network analysis, including network descriptive values, for example, degree, for graph were calculated in Gephi. Communities were identified using Blondel's (Blondel et al., 2008) fast unfolding networks algorithm. The fast unfolding networks algorithm is one of the most computationally efficient methods to find high modularity partitions of networks in short time. The methods starts by assigning each node to a separate community there after calculating the gain in modularity by merging neighboring nodes in a community. This process is continued through the network, ultimately creating a new network with nodes assigned to communities (see Blondel et al., 2008, for a detailed explanation). In the Gephi software, the modularity algorithm can be controlled by a resolution variable that controls the number of communities the algorithm creates. This variable was changed to limit the number of tiny clusters. We increased the resolution value until even the smallest community has approximately one percent share of the documents. We also calculated the eigenvector centrality for each document in the data set. In graph theory, eigenvector centrality measures the influence a node has in . A value is calculated to all based on an idea that connections to other important nodes are more important than equal and/or low-scoring nodes. This centrality value describes a publication's relevance to the overall network created by the bibliographical coupling analysis. Filtering for communities, we identified the five most eigenvalue central publications from each community to be selected for systematic coding. Selecting the five most central publications was done to keep the final sample of documents in the coding phase manageable, and selecting the most eigenvector central publications was a method to take the most important publications from each community for a deeper analysis. Research subthemes Primary variables Scope of study Context of study Conceptual or empirical study Theory(ies) employed Key findings The authors offer the framework as a template to be customized to the study's objective at hand. For the current study, we adopted the majority of elements in the framework creating a eight-point approach including (1) the context of the study, (2) primary variables, and (3) the scope defined as a research gap. We merged the conceptual or empirical study and theories employed to be the (4) code employed for the theoretical or methodological framework. We also included a separate code for (5) the method used to understand the specific approach used in the study. For the key findings, we also created coding. The first focused specifically on any concrete (6) results mentioned, and the second focused on essential (7) discussion and (8) conclusions. This was explicitly done to highlight other study results and to inform the content analysis in the scholarly debate around the research themes. To add additional rigor to the coding process, the qualitative analysis was conducted using MAXQDA software. The software allows for coding documents, directly annotating the documents with the codes, and thereafter drawing syntheses from the created codings. The use of the software increases transparency and the trustworthiness of the analysis (Costa et al., 2017; Sinkovics & Alfoldi, 2012). After the coding schema was created, the practical coding process was completed by one researcher, who read the five eigenvector central full-text documents from each cluster and annotated the documents based on the framework. The second researcher had the role of validating the annotation results. The interaction between the researchers was to makes sure that all of the items in the coding schema were identified. The second researcher went through the papers and codings made by the first researcher to make sure that each schema item, if available, was identified. However, as publications can repeat the same information, our approach was designed to ensure we do not capture the same information multiple times per publication. This selection of an approach made using, for example, intercoding agreement impractical. The MAXQDA analysis software used in the coding automatically created cross-tabulations of the coded documents and created a synthesis document about the coded sections of text. These were used in the interpretation phase. The synthesis of coded sections provided by MAXQDA was used to interpret the results. The two researcher familiarized with the MAXQDA summaries of the communities independently. After independent review of the synthesis, the researchers discussed on their findings. There after the authors worked jointly to draw insights from the coding results, working toward synthesis of the core areas of future research. The retrieved publications are recent, with the first publication in the data set published in 2009, as seen in Figure 3. The data were retrieved in early June 2020; thus, publications for 2020 represent the first 5 months, and one should expect a growing trend in publication volume. While Figure 3 shows growth, it is important to put this into context. Figure 1 projects the frequency of the topics of big data and data analytics in the public policy related literature concerning the overall public policy literature for the same duration. The publication volumes are normalized on log 10 scale to be able to illustrate them in one view. It is clear that the body of literature focused on big data and data analytics in public policy is experiencing a much sharper increase in interest in comparison to the overall public policy body of literature. Over half of the publications are from computer science, social sciences, or engineering. As seen in Table 1, the three mentioned disciplinary areas have over 100 publications. Table 1 highlights all disciplinary areas with over 20 publications in the data set. Notably, the smaller areas are case-study driven, highlighting the use of big data and data analytics in, for example, the topics of health care and environmental issues and energy. The descriptive analyses of the data also highlight the different journals that are attracting manuscripts on the topic. As shown in Table 2, the major publication sources for the articles included in the data set are mostly computer and information science journals. Analysing Table 2, we should note our scope. The study focused solely on big data or data analytics and its use in public policy-making. In the listed journals, the number of articles focusing on any single area, for example, big data is much higher. We should also note that the search in SCOPUS only looks at the title, abstract and keywords, making articles discussing big data or data analytics and policy-making in full-text only undiscovered by the query. It is notable that the publication sources with at least five publications have, all together, 105 publications, or approximately 19.5% of all publications. This implies that publications are scattered over many different publication sources, and an on-going debate on the subject is hard to pin down to a specific outlet. The division of the publications by country aligns with that of global scientific publication production, except China, which, with 60 publications, is lagging significantly behind the United States, with 141 publications.2 The two largest science producers are followed by the United Kingdom (55 publications), Italy (37 publications), Netherlands (29 publications), and India and South Korea (both with 28 publications). Looking at the affiliations of the publications, they are, again, sporadically divided between different organizations. As seen in Table 3, only Delft University of Technology has a significant track record of publications in the data set, and already, starting with the third-highest publication count, there were only five publications per institution. To understand more in-depth the sample, we concatenated the title and keywords of the publications in the sample. We then cleaned the concatenated title and keyword fields by removing punctuations and English stopwords.3 Seen in the Figure 4, the terms extracted are as expect, focusing on big data and public policy. More thematic terms emerging to the most frequent terms relate to management, health, cities, and media, giving insight into the articles' context. Overall, the descriptive analyses highlight that the data set contains only recent publications, divided among multiple sources and institutions, with relatively low volumes produced by each. The terms in the publications aligned with the search query. The 538 publications' bibliometric data were analyzed for bibliographic coupling using VOSviewer software. During the calculation process, the software first analyzed if any of the publications were detached and isolated from the overall network emerging from the data. VOSviewer calculates the largest subnetwork in the full networks and offers an option to only used the largest connected set. Selecting to use the largest connected set allows focusing on the core documents, removing the isolates. In our data set, the largest set of connected items creating a network was 332 documents. The outlier documents were dropped from the final sample, as these articles would not be included in the bibliometric coupling based clusters but would, rather, remain isolates throughout the analysis. Continuing with the cohesive sample of 332 articles, the VOSviewer created network was imported to Gephi software for further analysis. Network metrics were calculated, and the bibliographic coupling network had a weighted degree 22.56, with the density of the network being 0.048. This to say that there is significant linkages between node (documents) in the communities as weighted degree is significant, but the full network is not strongly linked, as the density is low. Next, the communities were detected using the modularity algorithm (Blondel et al., 2008). The modularity algorithm resolution was increased from its default value of one, until the smallest community had an approximately 1% share of the documents. With a resolution of 1.4, the algorithm resulted in nine communities, with the the smallest community has only 0.9% of the publication. The largest community had 32.23% of the publications. A visual representation of the created graph can be seen in Figure 5.4 In this figure, the color shows the clusters created using the modularity algorithm. The size of a node reflects the number of citations of an article in the network. As the graph highlights, the network was essentially created by two large communities visuali
Referência(s)