Artigo Acesso aberto Revisado por pares

How does fake news spread? Understanding pathways of disinformation spread through APIs

2021; Wiley; Linguagem: Inglês

10.1002/poi3.268

ISSN

1944-2866

Autores

Lynnette Hui Xian Ng, Araz Taeihagh,

Tópico(s)

Advanced Malware Detection Techniques

Resumo

What are the pathways for spreading disinformation on social media platforms? This article addresses this question by collecting, categorising, and situating an extensive body of research on how application programming interfaces (APIs) provided by social media platforms facilitate the spread of disinformation. We first examine the landscape of official social media APIs, then perform quantitative research on the open-source code repositories GitHub and GitLab to understand the usage patterns of these APIs. By inspecting the code repositories, we classify developers' usage of the APIs as official and unofficial, and further develop a four-stage framework characterising pathways for spreading disinformation on social media platforms. We further highlight how the stages in the framework were activated during the 2016 US Presidential Elections, before providing policy recommendations for issues relating to access to APIs, algorithmic content, advertisements, and suggest rapid response to coordinate campaigns, development of collaborative, and participatory approaches as well as government stewardship in the regulation of social media platforms. 在社交媒体平台上传播虚假信息的途径是什么?本文通过收集、分类并定位大量关于社交媒体平台提供的应用程序编程接口(API)如何促进虚假信息传播的研究来解决这个问题。我们首先研究了官方社交媒体的API的现状,然后对开源代码库GitHub和GitLab进行了定量研究,从而了解这些API的应用模式。通过检查代码库,我们将开发人员对 AP I的使用分为官方和非官方两类,并进一步开发了一个四个阶段组成的的框架,描述了在社交媒体平台上传播虚假信息的途径。我们进一步强调该框架中的各阶段是如何在2016年美国总统选举期间激活的,然后就API访问、算法内容、广告等相关问题提供政策建议,并建议快速响应以协调竞选、发展合作、以及在社交媒体平台监管中的参与式方法和政府管理。 "Fake news" is commonly used to refer to news that is false and that could mislead readers/viewers. Under the umbrella term "fake news," there are three common categories: "disinformation," "misinformation," and "malinformation" (Shu et al., 2020). These three categories are segregated in terms of their intent. "Disinformation" intends to deceive, and hence common techniques involve targeting profiles and fabricating content. "Misinformation" does not have malicious intent; examples include urban legends and genuinely false information. "Malinformation" has an intent to inflict personal harm: hate speech and harassment fall under this category. In this article, we examine "disinformation," which has the goal of deceiving people. We study the use of code repositories to access Platforms through APIs for spreading disinformation on the platforms and examine how actors with malicious intent can utilise the platforms for their purposes. We used our findings to inform platforms and governments on pathways of disinformation spread and how to address the issues identified. We consider the following goals of an actor with an intent to spread disinformation in a network: (1) dissemination of a message across the network; (2) information amplification on desired topics; (3) planting/altering the views of large groups of users. As an "actor" operating tools and technologies to spread disinformation on social media, to achieve these goals, one needs to: (1) join an organic network of users, so the actor's message reaches real users; and (2) hide one's trace to avoid detection and suspicion, which would reduce the effectiveness of the message. Fake news on social media platforms has become a contentious public issue as social media platforms offer third parties various digital tools and strategies to spread disinformation to achieve self-serving economic and political interests and distort and polarise public opinion. We study disinformation campaigns in the context of social media platforms. While social media platforms are revenue-generating businesses that promote user account and content creation, they have inevitably led to malicious actors spreading fake news. To achieve a successful campaign, an actor must perform a series of actions on the platform, some of which depend on others. A sequential combination of these actions characterises a pathway. This study on the pathways for spreading disinformation seeks to identify specific methods and stages of spreading disinformation. This will facilitate identifying new procedures to ensure the reliability and accuracy of disseminated information and increase the significant transparency of artificial intelligence- (AI-) driven data collection and algorithmic mechanisms for scenarios like online content recommendation. The study also profiles the risks and threats of AI-curated and generated content, such as a generative pre-trained transformer (GPT-3) (Brown et al., 2020) and generative adversarial networks (GANs) (Goodfellow et al., 2014). While revealing the ethical issues involved in curating and delivering online content, the study will help develop policy and legal responses to tackle online disinformation. In the following sections, we define multiple pathways of disinformation within a framework. Application programming interfaces (APIs) can be used to obtain data from the platforms or inject information to the platforms.1 We categorise social media APIs and seek to understand how APIs facilitate disinformation. We review the most investigated platforms concerning disinformation campaigns. We then collect further information from open-source code repositories from GitHub and GitLab to understand how developers use APIs. We then develop a framework regarding how an actor may spread disinformation on social media platforms. Finally, we investigate a case study using the framework developed in relation to the 2016 US Presidential Elections before providing recommendations for platforms and governments to address the issues relating to access to APIs, algorithmic content, advertisements, and suggest rapid response to coordinate campaigns, development of collaborative and participatory approaches as well as government stewardship in the regulation of social media platforms. Appendix A1 of the Supporting Information Materials along with Tables S1 and S2 highlight the details of the literature review methodology. Our examination reveals that the existing body of work remains segmented in its focus on particular technologies to spread disinformation on platforms (e.g., bots, APIs, tweet content). The studies focusing on bots analyse them predominantly in the context of spreading political disinformation. Empirical studies that analyse data from code repositories are mainly primarily concerned with assessing the success of already executed disinformation campaigns, such as by analysing the effects and patterns of information propagation in response to a limited set of actions and only focus on a single platform (e.g., Twitter) (Kollanyi, 2016; Shao et al., 2018; Zhou & Zafarani, 2018). While these studies are useful to understand how different technologies operate on digital platforms and their effect on the spread of disinformation, there is a lack of integration of these methods that reflect how developers and platforms spread disinformation using a combination of tools and the different ways that APIs facilitate these processes. To the best of our knowledge, no study has attempted to characterise the different actions deployed through APIs on different platforms to spread disinformation. An emerging body of research examines disinformation and the tools facilitating their spread on digital platforms. These include studies that provide frameworks of the types of fake news being spread (Jr et al., 2018; Machado et al., 2019) and conduct in-depth analysis on the construction of messages and the credibility of their creators (Zhou & Zafarani, 2018). Many studies perform broad reviews on various channels (both digital and nondigital) to spread disinformation, such as examining how countries utilise television, social media, internet trolls, and bot networks to conduct political disinformation campaigns targeted at other states (Moore, 2019) and algorithmic recommender systems' roles in influencing user engagement and disinformation spread on social media (Valdez, 2019). Other works have examined the digital technologies available in the entire ecosystem of services, including both platforms and other service providers that enable targeted political campaigns on a massive scale. In particular, they analyse the tools used for collecting and analysing behavioural data and how digital advertising platforms profile and customise messages targeted at different audience segments (Ghosh et al., 2018). Many scholars focus on the role of bots in spreading disinformation, with a predominant focus on Twitter, whereas there are limited studies that analyse APIs' role specifically in the spread of disinformation on other platforms. Studies have examined the origins of bots (Kollanyi, 2016), the pathways through which they function and which platforms they usually target (Assenmacher et al., 2020), and several studies have developed typologies of bots. Several typologies for Twitter bots have been produced, including characterisation of their inputs and sources, outputs, algorithms and intent and functions (Gorwa & Guilbeault, 2020; Lokot & Diakopoulos, 2016; Schlitzer, 2018), and these distinguish between bots that are used to increase the reach of a message and those that amplify a political narrative in a certain direction (Bastos & Mercea, 2018). Other studies analyse the availability of and mechanisms through which bot code is traded on the Dark Net to facilitate malicious uses (Frischlich et al., 2020). They focus on the digital infrastructure provided by APIs but do not analyse the different actions taken through these APIs to spread disinformation. Various studies model and conduct empirical analysis of the diffusion of disinformation in response to specific actions and campaigns. For instance, Tambuscio et al. (2015), in the computing literature, have developed a diffusion model of the spread of disinformation. Shao et al. (2018) analysed the messages spread by Twitter bots in response to the 2016 US presidential campaign and election, while (Santini et al., 2020) conducted an empirical analysis of Twitter bots' behaviour in amplifying news media links to two Brazilian news sites that manipulate news media entities' online ratings and the relevance of news. An API is a "programming framework that implements a set of standard behaviours" (Puschmann & Ausserhofer, 2017). This article classifies APIs into two categories: (1) official APIs; and (2) unofficial APIs. Official APIs involve a developer having a platform-issued developer key or an authentication secret. Social media platforms control the use of official API keys to varying degrees. With these keys, developers can gain access to two sets of data: (1) data without restriction, which refers to data that users choose to share publicly; and (2) data restricted only to information about the developer's account (See Appendix A2 and Table S3). Unofficial APIs include APIs meant for internal purposes that are used by third parties for unintended purposes. For example, a developer can examine how an official app on a device exchanges data with the platform's remote server and attempt to mimic that communication to develop new applications (Russell, 2019). Another type of unofficial API is code repositories that employ the web scraping method.2 Some methods involve downloading the HyperText Markup Language (HTML) page, then parsing the page and extracting elements that match specific texts before executing actions on the HTML elements. The ease of employing these methods in the absence of official APIs affects the number of developers that will harness the platform for their agenda. Analysing the number and type of unofficial APIs relative to the number of official APIs used by developers is potentially valuable to understand the primary channels used to spread disinformation, and how usage evolves and differs across platforms. In addition, it can provide key insights to understand the extent to which platform operators are aware of or should be accountable for how developers take advantage of these APIs to spread disinformation. We do this by examining social media APIs, and the types of actions registered users perform. The API documentation pages and literature documenting discontinued or undocumented APIs were also analysed. To understand the potential uses of social media APIs, we look to open-source code repositories. In this study, our primary data sources are the public access code repositories, GitHub and GitLab. We investigate only open-source code platforms where the code is readily available, and that facilitate open collaboration and reference of codes via search terms. We thus miss out on code in private repositories and the Dark Web. GitHub is the largest online code repository for shared computer code, with over 50 million users and 100 million repositories (Github, 2020b), and is the fastest-growing open-source platform (Kollanyi, 2016). While most of the GitHub projects are developed for timely retrieval of updates or auto-liking of a close friend's posts, the public accessibility of the repositories enables parties with a malicious intent to adapt the available code and construct their own versions easily. Furthermore, GitLab has an integrated continuous integration/continuous deployment pipeline, enabling the developed code to be instantly deployed for production. As a result, if the code is intended for a bot, the bot can perform its programmed tasks swiftly once the code is deployed through the pipeline. The GitHub/GitLab code repositories were sampled using the GitHub/GitLab developer search APIs to identify codebases that perform tasks related to the constructed pipeline. This sampling was done by searching each individual platform name together with the word "bot." We then used the user search API to obtain a user item for each repository. This user item contains the profile of the user, such as the number of followers, the number of following, the number of repositories and the location (Github, 2020a). We extract the user's declared location from the user item. To map the extracted location to a country, we queried OpenStreetMap using its Nominatim search engine API, which searches through a list of addresses to find the closest matching address and its country. OpenStreetMap is a community-driven map built by enthusiast mappers and geographical professionals using aerial imagery and GPS devices (OpenStreetMap Contributors, 2017). It contains detailed street data of places around the world, even down to street-side cafes, that enthusiasts have manually entered. Over the month of May 2020, we collected 69,372 code repositories. Over 40,000 repositories were collected that pushed content to social media platforms. Most code repositories cater for the platform Telegram, followed by Twitter, then Facebook and Reddit. The distribution of code repositories is shown in Figure S1. Appendix A3 in Supporting Information Material provides details of the data collected (Distribution of code repositories across social media platforms, Word cloud of descriptions from code repositories, Origin of data repositories collected, and Distribution of Programming Languages used in API Repositories, in Figures S2–S5, respectively). The number of repositories created per month increased exponentially from 2014 to 2018 before decreasing from 2018 to 2020. To supplement this observation, we manually searched social media API documentation on API changes. Sharp changes in the number of code repositories can be attributed to API changes of social media platforms (see Appendix A4 and Figure S6a,b). To understand how these code repositories perform their tasks, we sought to understand whether they use official or unofficial APIs. To this extent, we may infer how much official APIs provided by the platform facilitate the spread of disinformation. We first formulated a list of keywords relating to official and unofficial methods of accessing platform data and functionality. The keywords of official methods were collected through manual inspection of each social media platform's API documentation. The initial analysis of code repository content showed that repositories using official APIs typically contain an authentication string that uses the same name as is stated in the social media platforms' API documentation. Unofficial methods of accessing platform data and functionality typically comprise web scraping methods, many of which were originally developed for web user interface testing (e.g., selenium). We examined the code content of several repositories to profile unofficial methods of performing actions on platforms. We queried GitHub/GitLab code hosting sites to search within the code content for the keywords through the blob search functionality to understand the distribution of methods used by repositories to perform specific actions on platforms.3 Table 1 lists a sample of specific content identifiers used, along with the blob search functionality for searching through the content of the code repository, which is used to indicate the usage of official or unofficial APIs. For a complete table, refer to Table S4. After obtaining the code repositories through the blob search functionality, we carried out a systematic deduplication of each repository as some repositories mentioned the content identifier more than once and were hence counted more than once. We then queried the code hosting sites to identify the countries of the users that created the repositories. We performed the search on four main platforms—Twitter, Facebook, Instagram, and Reddit—as these platforms draw the most repositories and are most widely used in disinformation campaigns. This was followed by exploring the characteristics of the repositories through a timeline of repository activity concerning the release or discontinuation of social media APIs. After building our knowledge base through a literature search and drawing on data from categorising social media APIs, and data collection on the open-source development landscape of APIs, we developed a theoretical framework of pathways that can be used to spread disinformation on platforms. As was highlighted under the literature review section, the studies conducted so far are segmented. They focus on particular technologies to spread disinformation on platforms (e.g., bots, APIs, tweet content), predominantly focus on a specific context such as spreading political disinformation, assessing the success of already executed disinformation campaigns, or are limited to a set of actions and only focus on a single platform. While we appreciate these scholarly works and they are useful to understand how different technologies operate on digital platforms and their effect on the spread of disinformation, there is a lack of integration of these methods that reflect how disinformation is spread using a combination of tools and the how APIs facilitate these processes in various ways. To the best of our knowledge, the theoretical framework presented in this article is the first that characterises the different actions deployed through APIs on different platforms to spread disinformation. This theoretical framework contributes directly to understanding pathways for content distribution and content collection on social media platforms and applies to different platforms. Drawing from research in social sciences and computer science, we identify four key stages for the spread of fake news: Network creation; Profiling; Content generation; and Information dissemination. We examine these stages in-depth, identify and group actions that can be performed in each stage, and present the relationship between these stages in the rest of this section. We further highlight how through these pathways, the goals of the actor(s) with intent to spread disinformation on social media platforms such as dissemination of messages across wide networks, information amplification and reiteration on desired topics, planting/altering the views of large groups of users, and using influential users to spread their own messages may be achieved. This theoretical framework on the spread of disinformation is visualised in Figure 1, and a full tabular breakdown can be found in Table S5. An actor can perform a singular action on a social media platform, like following another created account. The actions are grouped into four stages, Stage 1: Network creation; Stage 2: Profiling; Stage 3: Content generation; and Stage 4: Information dissemination. A sequential combination of these actions in linear stages is a "pathway," which ends at a goal. An example of a pathway could be Stage 1: "Create user account"; Stage 2: "Attribute-based selection of audiences"; Stage 3: "Text generation"; and Stage 4: "Engage with users"; this ends at "Join a human network." The framework illustrates various pathways, which may be activated in a parallel fashion over time, to reduce time delays in steps and amplify the information dissemination effect. For example, creating user accounts takes time, but profiling users during an event need not wait. Further, actors occasionally return to a previous stage or substage during the activation of a pathway after they have understood their selected audiences better and decide to perform further actions to enhance their information dissemination to the target audience. We profile two main types of accounts: (1) "pseudo-accounts," which are accounts created by actors, such as bot accounts, and (2) "user-accounts," which are accounts that real human users create. By extension, the usage of the term "pseudo posts" refers to posts that are generated by the "pseudo-accounts," and "user posts" refers to posts written by human users. From examining the literature collected in Section 3.1, we identified different actions associated with the use of APIs to spread disinformation. Each action was distinguished based on the mechanism through which it is executed and the platform(s) it is executed on. We then associated each action with data gathered on API usage in open-source code repositories to further distinguish the actions based on how APIs (official and/or unofficial) are used to executing these actions. A categorisation of the action "inflate retweet counts" could be described as follows. Through manual inspection of social media APIs, Twitter provides a mechanism to auto-like posts, and Instagram provides an API to auto-like all posts on a particular feed. Inspection of collected data on API usage indicates code repositories that use official and unofficial APIs to like posts with certain keywords on Twitter. For Instagram, we only found code repositories that use an unofficial API to auto-like and auto-follow a particular feed. This is likely because developers cannot pass the strict Instagram app review processes required to obtain official API access and thus must use unofficial means. Next, actions exhibiting similar objectives and characteristics were grouped and characterised according to the type of actions employed. For instance, the actions of identifying users based on their follower account and identifying users that have posts associated with particular interests (and more specifically, keywords) are categorised as "attribute-based selection of audience," where user-accounts are selected based on whether they have (or do not have) certain attributes. We conducted the process of examining and categorising action types iteratively and with reference to the large base of scholarly literature collected and referred to in Section Preliminary results of the examination of the literature. Lastly, we grouped the actions into four overarching categories representing different stages of curating and spreading disinformation on platforms. We presented a flowchart in Figure 1 that maps the possible pathways of disinformation dissemination on platforms that are facilitated by APIs: (1) network creation; (2) profiling; (3) content generation; and (4) information dissemination. We examined platforms commonly used in disinformation campaigns—Twitter, Facebook, Instagram, and Reddit. The Russian Internet Research Agency (RIRA) produced around 4234 Facebook posts, 5956 Instagram posts and 59,634 Twitter posts (Howard et al., 2019), spreading disinformation by creating false personas and imitating activist groups. Operation Secondary Infektion (Ben Nimmo et al., 2020) used Reddit, among other social media platforms, in the 2016 US elections and the 2017 German and Swedish elections. One key insight from the construction of the flowchart is that different social media platforms control the use of official API keys to varying degrees and hence play different roles in facilitating the spread of disinformation through different pathways and Table S5 summarises all the actions in the pathway to disinformation. In the network creation stage, actors create a network of pseudo-accounts that will subsequently automate the execution of actions, each with a customised profile, identity, and purpose. Table 2 summarises the three main classes of actions that can be performed in this stage. To begin the pathway, the actor needs to create a network of bots. Procedural account generation is the mass creation of individual accounts, where each account can have its own persona (i.e., age, gender, likes, dislikes). While official APIs do not provide this functionality, unofficial APIs, like web browser automation frameworks, allow for creating accounts (Jr et al., 2018). However, this is becoming increasingly difficult as social media platforms seek to reduce the creation of pseudo-accounts: Twitter requires a valid phone number, and Instagram requires solving a CAPTCHA. Some actors use a rotating Virtual Private Network to avoid detection by social media platforms, which avoids detection through rotating IP addresses. Actors can also obtain existing accounts that are already active. This reduces their need to obtain the necessary verification details like phone numbers to create new accounts, and they thus inherit the accounts' purposes, traits, and network. It is possible to use a session cookie hijack attack to reset the account password (Ghasemisharif et al., 2018) or to exploit a flaw in the OAuth 2.0 security authentication protocol to obtain an access token (Hu et al., 2014). Actors can also obtain existing accounts through the Dark Net. Bots, or semiautomated accounts that mimic human behaviour, are readily found in underground markets in the Dark Net, which are forums and websites not indexed by search engines (Frischlich et al., 2020). The Dark Net requires a TOR browser, an anonymity-oriented browser, to access the websites. In general, the Dark Net provides bots across most social media platforms. Fake Facebook and Twitter accounts trade for around 5–9 Euros on average, and the highest price observed in a recent study was 42,435 Euros for a week's access to a botnet (Frischlich et al., 2020). Upon obtaining a series of online profiles, the actor then needs to create a network of pseudo-accounts by following or friending his own created pseudo-accounts. Political bots used the technique of following one's network of accounts to create a false impression of popularity across geographies (Woolley, 2016). He can then like/share posts from the pseudo-accounts to increase the attention given to the accounts since social media algorithms elevate more popular posts. Since the goal is to disseminate information to real users, the actor also needs to perform the same follow/friend actions on real accounts, hoping that a few will reciprocate. By liking/sharing posts from real accounts, the actor attempts to build trust with real users. In the profiling stage, APIs are used to track user engagement with digital content on the platform or external websites, or track the location of the user's device, or track the different devices used by the user. Behavioural data collected here are used to extract knowledge about the user that will be used in the next stage to profile them and to tailor messages targeted to their unique preferences (Ghosh et al., 2018). With tools provided by Google, actors can insert web beacons into web pages that track users' actions in real-time. Individual profiles can be formulated for subsequent targeted information dissemination by studying user mouse clicks, hand movements, and hovering cursors. Some research is on building preference models given the preferences users exhibit through their online content (Recalde & Baeza-Yates, 2020) or the content they express likes for (Matuszewski & Walecka, 2020). An indication of these models' effectiveness is their ability to detect suicidal tendencies (Ramírez-Cifuentes et al., 2020), which by extension, allows the construction of models that can infer whether the user will believe and spread disinformation content. Table 3 summarises the three broad categories of actions under this stage. To profile users' interests, actors may track their engagement with paid digital content on web pages. Using Google analytics, actors can create first-party web cookies to track clicks on advertisements and user search items and results. Actors can then directly collect user behavioural data to link these to an individual's personally identifiable information, like email addresses or mobile phone numbers. Facebook's Audience Network API reports user engagements for actors that construct their own Facebook pages, such as likes on posts and advertising campaigns' reach. While the API does not provide users' personal data, it is a tool that can be used to segment demographic groups so that actors can focus on individuals who are highly responsive to particular messages (Facebook, 2020a). Various actions can be taken to select the desired group of users based on whether they exhibit (or do not exhibit) a certain attribute (Guilbeault, 2018). This method draws on digital marketing and advertising ideas, including identifying users through societal segments, such as topics

Referência(s)