Essay: Complex Social Systems – You’ll Need More Than Just Big Data

November 14, 2018February 7, 2019 Tom Briggscomplexity, complexity science, computational modeling, computational social science, data science, measurement, systems

What makes a “complex system” so vexing is that its collective characteristics cannot easily be predicted from underlying components: the whole is greater than, and often significantly different from, the sum of its parts. A city is much more than its buildings and people. Our bodies are more than the totality of our cells. This quality, called emergent behavior, is characteristic of economies, financial markets, urban communities, companies, organisms, the Internet, galaxies and the health care system.

–Geoffrey West, Santa Fe Institute

The ability to collect and pin to a board all of the insects that live in the garden does little to lend insight into the ecosystem contained therein.

–John H. Miller and Scott E. Page

Defining and Understanding Complexity

Miller and Page (2007) initially punt on defining complexity by invoking Justice Stewart’s definition of pornography: “I know it when I see it.” Many definitions of complexity and complex systems have been suggested, yet none is widely accepted. Often, simply describing the features of complexity seems the modus operandi in explaining complexity.

Key features of a complex system include:

Constituent parts that interact in such ways as to give rise to non-linear and often unanticipated or unpredictable outcomes, outcomes particularly unexpected if the system were examined in a reductionist fashion
Feedback loops between parts and levels of the system and often the system and the environment (Simon, 1969)
Self-organization of constituent parts, adaptation, evolution
And, possibly unique to complex human social systems, the possibility for second-order emergence – emergence of reflexive social institutions based on human collective action (Gilbert & Troitzsch, 2005)

Gilbert and Troitzsch (2005) offer the idea that emergence requires new descriptions that are not required to describe the behavior of underlying components: an individual atom has no temperature, but the interaction of atoms in motion gives rise to temperature. Complexity scholars have also distinguished complication from complexity as a means of explaining complexity. Miller and Page (2007) suggest that removing a seat from a car makes it less complicated while removing the timing belt makes it less complex. Santa Fe Institute president David Krakauer noted in an August 2015 interview, “A watch is complicated…your family is complex,” suggesting that we understand how all of the constituent parts of a watch work together to make a functioning timepiece, but we do not fully understand the various forces that make a family function or not function. Removing a specific part from a watch has a predictable, known consequence. Removing–or adding–a family member changes the interactions in the family’s social system in unknown and unpredictable ways.

The crux of complexity in social systems, then, is how the interactions between individuals in the system give rise to new, emergent properties of the system that cannot be understood by studying each individual alone, as represented by the poetic if macabre Miller and Page (2007) quote regarding pinning butterflies. Perhaps one of the most well-known examples in computational social science (CSS) of macro-level emergence from the interaction of agents in a complex social system is Thomas Schelling’s model of segregation, in which he demonstrated that as individuals choose where to live based on their even very slight preference for having some neighbors who look similar to them, a tremendous degree of residential segregation akin to that observed in many American cities results without any governmental or other top-down organizing schema (1971). Likewise, Simon (1969) recounts teaching urban land use to architectural students who had difficulty accepting that land-use patterns in medieval cities arose from cumulative individual decisions over time rather than top-down guidance from a central planner or designer.

Miller and Page (2007) suggest that innate features of social systems tend to produce complexity: social agents are “enmeshed in a web of connections with one another and, through a variety of adaptive processes, they must successfully navigate through their world” (p. 10). Part of agents’ navigation of the world necessarily involves making decisions and undertaking behaviors either in response to the decisions and behaviors of others, or, importantly, in anticipation of what others will do. The number and disparate types of connections result in non-linear behavior and an inability to reduce the system to its constituent parts without losing the emergent properties of the system (Miller & Page, 2007). Torrens (2010) notes that self-organization and the propagation of information back and forth across scales – notable features of human social systems – embody emergence, a hallmark of complexity.

“Big Data” and Data Science – Not Enough for a Science of Complex Social Systems

Like complexity, definitions of “big data” can seem difficult to pin down, particularly depending on perspective. Technical perspectives approach big data in terms of the “3Vs”: volume, velocity, and variety of data. This perspective is concerned with factors like storage space, transmission networks, and sensors. Another perspective is that of the scientist and researcher: instead of data collection as an expensive, painstaking, time-consuming process that nevertheless results in small samples and woefully inadequate statistical power, it is now possible in some disciplines to quite literally download data that can plausibly be used for research by writing just a small amount of code and tapping the API of a site like Twitter.

Cioffi-Revilla (2014) has described computational social science (CSS) as an “instrument-enabled discipline.” Inasmuch as CSS utilizes computation to investigate complex social systems, big data—and even bigger “computers”—are perhaps an extension of this paradigm: an improvement to our scientific instruments for the study of social complexity. A fascinating example is the controversial research on massive-scale emotional contagion through the social network Facebook. In the research team’s paper, which sought to investigate a phenomenon in which individuals are affected by the emotional expressions of others—and, in turn, affect others through their own expression or withholding of emotion—they noted that the miniscule but statistically significant effect size could only have been detected in a sample as large as that available to the Facebook Data Science team (Kramer, Guillory, & Hancock, 2014). In the context of complex social systems, then, big data represents improved measurement possibilities. At one time, measures of length were imprecise at best – the width of a man’s thumb, length of his foot, the breadth of his outstretched arms – these were the original, inconsistent measures of inch, foot, and yard. Measurement certainly became more precise and more accurate tools were propagated, but more or better data did not change the underlying construct of human height, though it may have helped improve the ability to study it.

Big data isn’t required to appreciate or understand social complexity, however. Returning to Schelling’s work on residential segregation, it is noteworthy that his initial investigation required little more than coins placed on a checkerboard that were then moved according to a series of simple rules. Schelling did not possess or even generate big data, but the modeled social system contained all of the features of a complex system: interacting agents, feedback, adaptation, and emergence. It is also the case that enormous datasets might reveal nothing about complexity; a computer is a complicated machine capable of generating enormous amounts of data on CPU and memory cycles as it operates, but this is not complexity: it is merely executing code, as designed and instructed. A computer, then, is a vastly more complicated watch.

In 2008, WIRED Editor-in-Chief Chris Anderson proclaimed that the deluge of data spelled “the end of theory” and made the scientific method obsolete. Anderson argues that we’ve moved beyond needing to seek causation when we find correlation, that “correlation is enough” with big data. The question – perhaps best left to philosophers of science – is how to define “enough?” Anderson points to Google’s success at solving tasks algorithmically, by throwing more data at more computational power, without the need to even understand the underlying data. Surely one can think of examples in which “enough” might pass the sniff test for a profit-motivated entity, but perhaps not for the scientist driven by intellectual curiosity. An enormous dataset of measures of sky color all over the earth would establish a strong correlation with the sky being blue at midday, yet this tells us nothing about why the sky appears blue to us. Likewise, human beings, owing to our bounded rationality and limited cognition (Simon, 1969, 1976), are fairly terrible sensors in comparison to the satellites and robots NASA might send to Mars. Yet preparation and training for a manned Mars mission is earnestly underway. Why? Arguably, because human curiosity transcends merely knowing “good enough” correlation. Simon described “the vivid new perspective we gained of our place in the universe when we first viewed our own pale, fragile planet from space” (1969). Enormous data on the tremendous number of stars and planetary bodies hadn’t taught that lesson; it required space travel, an enormous feat of collective action in a complex society (Cioffi-Revilla, 2014).

While the “big data” buzzword declined in the first decade of this century—at least according to Google’s Ngram Viewer (see embedded chart at top of post), it is still a paradigm taken seriously by complexity scholars and computational scientists. SFI’s Geoffrey West sees a role for big data in enabling large-scale simulations and models of complex social systems – if, he asserts, we determine a “big theory” to guide which questions we ask and which data we use (2013). In the Manifesto of Computational Social Science, Conte et al. (2012) likewise suggest that big data will play an important role in investigating important questions of human social complexity, but only when coupled with the core principles and concepts of CSS: psychology and the human mind, uncertainty, social change and adaptation, networks, and non-linear and non-equilibrium dynamics, to name but a few.

Pietsch (2013) also takes a highly integrative perspective, using philosophy of science to answer the charge that big data spells “the end of science.” Calling big data “the new science of complexity,” he refutes the notion that big data is not concerned with causality in complex social systems, and in fact suggests that big data will allow for a “contextualization of science” at the level of complex systems rather than attempting to model causality by reducing a phenomena through “dubious simplifications” common in techniques like structural equation modeling used in social science (Pietsch, 2013).

There is little doubt that big data offers exciting new prospects for the study of complex social systems, perhaps in validating complex social system models like Robert Axtell’s 1:1 model of the U.S. economy (Axtell, 2016) or providing more reliable and robust datasets on agent interaction through the sensors contained in smartphones and other so-called “wearables.” Big data advocates who decry the end of the scientific method, however, would do well to keep the complexity hallmark of emergence in mind, though, since emergent behavior is by nature unpredictable. If the emergent property of a complex social system has not yet emerged, there may be nothing in the data – regardless of size – that can describe or predict what’s yet to come. Moreover, the adaptation to feedback that is characteristic of complex social systems also suggests the possibility that big data itself becomes part of the environmental landscape, feedback to which our existing complex social systems and the agents therein will adapt and evolve!

Conte et al. (2012) see a role for big data in the modeling stage when investigating complex social systems; that is, data can reveal statistical features of the system to be studied, and these features can be incorporated in complex social system model, or the emergence of such features may become the object of study. Caution should be exercised in “forcing” big data into simulation models (Conte et al., 2012) and highly detailed predictions of complex social systems, even with big data, may never be possible (West, 2013).

In sum, complexity in social systems is present with or without “big data”; simply observing three preschoolers as they interact, communicating with each other via the linguistic symbol system that emerged to transcend individual human cognitive limitations and with each preschooler predicting and reacting to what each other says or does, can very well lead to highly unpredictable and emergent behavior! At the same time, enormous data can exist from very complicated machines that are not, themselves, complex because they fail the hallmark tests of complexity: self-organization, feedback, emergence. From a methodological perspective, big data technologies and techniques represent new possibilities for how complex social systems might be studied in the discipline of computational social science (e.g., Conte et al., 2012). The fact that computational social science is generative – i.e., can you grow it? (Epstein, 1999) – at times invites the dubious if well-meaning “But where did the data in your model come from?” question, as if actual data generated by human beings – regardless of how or under what circumstances – somehow trumps even the most elegant and effective model. CSS must continue to expand its interdisciplinary toolbox of scientific instruments (Cioffi-Revilla, 2014) and embrace big data as yet another tool to improve our models, our understanding, and our explanations of the complexity inherent in social systems.

REFERENCES

Axtell, R. L. (2016, May). 120 million agents self-organize into 6 million firms: a model of the US private sector. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems (pp. 806-816). International Foundation for Autonomous Agents and Multiagent Systems.

Cioffi-Revilla, C. (2014). Introduction to computational social science: principles and applications. London: Springer.

Conte, R., Gilbert, N., Bonelli, G., Cioffi-Revilla, C., Deffuant, G., Kertesz, J., … Helbing, D. (2012). Manifesto of computational social science. The European Physical Journal Special Topics, 214(1), 325–346. http://doi.org/10.1140/epjst/e2012-01697-8

Epstein, J. M. (1999). Agent-based computational models and generative social science. Generative Social Science: Studies in Agent-Based Computational Modeling, 4(5), 4–46.

Gilbert, G. N., & Troitzsch, K. G. (2005). Simulation for the social scientist (2nd ed). Maidenhead, England ; New York, NY: Open University Press.

Kramer, A. D. I., Guillory, J. E., & Hancock, J. T. (2014). Experimental evidence of massive-scale emotional contagion through social networks. Proceedings of the National Academy of Sciences, 111(24), 8788–8790. http://doi.org/10.1073/pnas.1320040111

Miller, J. H., & Page, S. E. (2007). Complex adaptive systems: an introduction to computational models of social life. Princeton, N.J: Princeton University Press.

Pietsch, W. (2013). Big Data–The New Science of Complexity. Retrieved from http://philsci-archive.pitt.edu/9944/

Schelling, T. C. (1971). Dynamic models of segregation†. Journal of Mathematical Sociology, 1(2), 143–186.

Simon, H. A. (1969). The sciences of the artificial (3. ed., [Nachdr.]). Cambridge, Mass.: MIT Press.

Simon, H. A. (1976). Administrative behavior: a study of decision-making processes in administrative organization (3d ed). New York: Free Press.

Torrens, P. M. (2010). Geography and computational social science. GeoJournal, 75(2), 133–148.

West, G. (2013). Big data needs a big theory to go with it. Scientific American, May, 15.

What are you measuring?

March 25, 2017March 25, 2017 Tom Briggsdata science, measurement, performance, quote

The gross national product does not allow for the health of our children, the quality of their education or the joy of their play. It does not include the beauty of our poetry or the strength of our marriages, the intelligence of our public debate or the integrity of our public officials.

It measures neither our wit nor our courage, neither our wisdom nor our learning, neither our compassion nor our devotion to our country, it measures everything in short, except that which makes life worthwhile.

—Robert F. Kennedy

Data Science, Ethics, and Academics in Industry

June 2, 2016April 28, 2017 Tom Briggsbig data, data science, ethics, privacy, psychology

I’ve been fielding more questions about research ethics and protecting individuals with regard to data science and big data. The topic warrants a much more in-depth discussion than this blog post, but I’ve noticed one trend that’s worth pointing out: academics previously working at research universities either leaving academia temporarily or permanently for tech companies and industry.

Academic researchers are almost always required to submit their research proposals to their organization’s Institutional Review Board (IRB), an interdisciplinary group of researchers charged with protecting human subjects as outlined in the 1979 Belmont Report and overseeing research ethics training at most universities and research organizations. Private companies are under no such obligation, as the controversial Facebook study (PDF) of emotional contagion demonstrated. These companies rely on the permissions granted by users who consent to the Terms of Service agreements prior to signing up for the service.

For me, it remains an open question whether researchers in private industry are adhering to a “do no harm” maxim. The obvious tension is that profit-motivated entities like startups and publicly-traded tech companies are interested in maximizing investor or shareholder value and are not subject to the same research ethics requirements as publicly-funded research universities.

I’m encouraged that some academic researchers like Jessica Vitak are tackling these issues and looking for ways to increase transparency in big data use. Vitak’s Privacy + Security Internet Research Lab is tackling exactly these questions. I had the opportunity to hear Vitak speak at the recent Human-Computer Interaction Laboratory annual symposium at the University of Maryland, College Park. One of the potential solutions that Vitak suggests is that the peer review process for academic publications and conferences needs to fill gaps left by insufficient IRB expertise in some areas of data science. This won’t necessarily change what private companies do with individual data, but it’s certainly a start. The controversial Facebook study now includes an “Editorial Expression of Concern,” which appeared after the publication of the study. Had the editor and peer reviewers at PNAS been more attuned to research ethics and human subjects protection during the peer review process, the Facebook authors might have been asked to do a much better job of addressing the ethical implications in their research.

Of course, this raises the thornier question of rejecting research that does not adhere to accepted human subjects protections: in this case, we do not reward the authors for failing to conduct research in an ethical manner, but we prevent information about the research from entering the public domain. I don’t have a good answer to this issue.

I don’t specifically intend to pick on the tech companies here. Plenty of other industries have, in the name of profit-driven research, done harm. But tech companies also represent a particularly desirable organization in which to do research. Traditionally, researchers, especially in the social sciences, had to painstakingly collect their own experimental or correlational data. This was both time consuming and expensive, and perhaps too often resulted in non-significant findings because the research sample was too small. Tech companies, on the other hand, are awash in data that represents a potential intellectual gold mine for social scientists.

My hope is that those who leave academia for the bountiful data available at tech companies remember and abide by their research ethics training, even when they aren’t required to. I also hope that tech companies are engaging with experts in research ethics and taking any objections by those experts seriously.

A recent NPR Hidden Brain podcast episode “This is Your Brain on Uber” featured an interview with Keith Chen, who appears to be both Head of Economic Research at Uber and also tenured professor at Yale. If he indeed holds dual roles, it raises important ethical questions about the research he is conducting for Uber. Does Chen conform to the same human subjects protection protocols at Uber that he must when working “at” Yale? Or is there an artificial separation because Uber isn’t Yale and isn’t subject to the same requirements?

During the episode, Shankar Vendantam at one point asks Chen about the implications for individual users’ privacy in research projects based on users’ data. Chen seemed concerned about the implications Vendantam raised, but also somewhat dismissive, simply suggesting that Uber has a Privacy Officer, a hire that was made only after a user outcry when it was discovered that an Uber executive may have inappropriately used his access to track the movements of a reporter. Chen said he didn’t usually worry about his behavioral data being used by tech companies, but that Vendantam’s question is now making him think more about it.

I am encouraged that reporters are challenging researchers and industry on their data and research practices and I certainly don’t believe we should throw the proverbial baby out with the bathwater here. There is much to be gained by using these first-ever datasets of human behavior that will add to what we know and understand about humans and social behavior.

It’s also the case that with great power comes great responsibility. Greater transparency, the involvement of research ethicists, and ensuring truly informed participants should be required not just for academic researchers, but also for researchers working in industry.

Look for a future post on the role of psychologists in the ethical conduct of research, and why I believe that a professional code of ethics is a vital component of protecting individuals.

A small pitfall in film actor social network analysis using IMDB data

March 1, 2016March 1, 2016 Tom Briggscomputational modeling, CSS, data science, network science, SNA, social network analysis

This is a short post on a minor but consequential pitfall of social network analyses of film actors.

One thing that has always bothered me about social network analysis of so-called “actor networks” using data from IMDB is the very simple fact that these analyses are based on the assumption that because two actors appear in the same film, they know each other.

This is simply not true.

Modern filmmaking techniques and the high cost of actor set time incentivizes filmmakers not to have expensive actors on set at the same time unless absolutely necessary. Instead, stand-ins are often used in place of star actors–especially in dialogue scenes–and footage is later edited to put the two star actors together in the finished product.

So, in theory, two actors can appear in the same film and even in the same scenes but never actually be on set together. Extrapolating, two actors could appear in the same film and never actually meet.

I’ve been waiting to find a solid example and finally found one.

Robert Rodriguez (@Rodriguez), the writer-director-producer best known for his films Sin City, From Dusk Til Dawn, Once Upon a Time in Mexico, and Spy Kids, was interviewed on the Tim Ferriss Show and described exactly this situation occurring during Sin City. Rodriguez describes Sin City as one of the most rapidly-executed projects he ever worked on, from initial concept and collaboration with Frank Miller to actually shooting the film in a matter of months. In fact, Rodriguez describes shooting scenes for Sin City with actor Mickey Rourke, in which Rodriguez or another crew member would stand in for the villain who at that time hadn’t been cast. Rutger Hauer was later cast and the complementary footage was shot for the scenes. According to Rodriguez, Rourke and Hauer claim they never met, despite appearing together in a Sin City scene in which Rourke’s character appears to have his hands on Hauer’s throat.

The lesson is what every good data scientist and computational modeler should always keep in mind: justify all assumptions and always include or at least consult subject-matter experts who know the system and data being studied!

Tom Briggs, PhD

Improve performance. Make work better.

data science

What are you measuring?

Data Science, Ethics, and Academics in Industry