What is Data, Really?
Published March 6, 2015
What is data science and what do data scientists do? How will they enhance learning technology products and services of the future? Inquiring minds want to know. I believe we have begun to assemble an amazing team of data scientists in Learning Analytics at McGraw-Hill's Digital Platform Group. The talent pool throughout the entire organization truly makes this company special.
What is Data?
Let's ask the first basic question: What is Data? At first glance the question might seem too simple. We all know what data is. We use it all the time. It is the currency of everything that we do in a digital environment. But do we really understand what data is? The most common interpretation is that data is something that conveys information.
Definition 1: Data = Information
I think this first definition is right and it's useful. In fact, one of the most significant and fundamental contributions to computer science comes from formalizing this definition. The work was done by Claude Shannon of AT&T Bell Labs and MIT. Shannon's work is at the core of modern computer science but is also one of the most beautiful and dazzling results in all of science. The idea of representing data as 0s and 1s, drawing on George Boole's work in the nineteenth century, originates in Shannon's classic paper on information theory written in 1948. While it takes a bit of mathematics to appreciate its power and reach, it is approachable to even a layman. Shannon's seminal work also describes how modern communication networks work: how data flows, how data is transformed, how data is encrypted, and how data is lost and regained on the Internet. More recently, Shannon's ideas are helping to illuminate how the universe works, both in large-scale structures such as black holes and in small-scale processes such as quantum entanglement.
Definition 2: Data = The Measurement of Something
Data understood as information is a good starting point. But for our purposes I prefer the informal definition by Jer Thorp. For those of you who don't know Thorp's work, he is a pioneer in data visualization. (When he worked at the NY Times his title was "Data Artist in Residence".)
Thorp's definition brings out several key ideas, which are ordinarily hidden from our common sense view about data. First, it makes explicit the idea that data arises from measurement. Why is this important? It's important because all measurements are error-prone. Scientists have been trained to understand this from day one: where there is data, there is also error. The two are inseparable. This is so fundamental an idea that it's important to pause and think through some implications. A true scientist will tell you that any presentation or representation of data is useless without a clear understanding of the error range. If you ask a true scientist for data, they are likely to give you not only the measured value but the confidence interval or error range. Depending on how the data was measured, derived, sampled and calculated, there is always an accompanying error. So one of the things that data scientists do is to try to understand thoroughly and rigorously the error associated with data. So when you think of data, think also of the penumbra of error surrounding it. Commonly known as the blurry outermost layer of a shadow, the penumbra is also a term for the intrinsic error associated with any piece of data.
Another key idea implicit in the definition is that we seldom work with individual data elements. Data travels in groups. And this introduces another source of error or uncertainty. If each data element carries with it a penumbra of uncertainty, data elements in combination and in relation can generate further distortions. This is where statistics comes into play. I think of statistics as the science of error. Modern mathematical techniques can be viewed as powerful lenses that allow us to peer and project into the inner workings of the world. Predictive modeling allows us to forecast the future. Data mining allows us to discern new patterns in large data sets. But the statistician is the craftsman who knows how to remove the distortions in our lenses. Statisticians are also adept at recognizing that what appears to be a meaningful pattern is in fact illusory and could be due to randomness. The old Sherlock Holmes was well versed in scientific induction. The new Sherlock Holmes is also a master in statistics, able to decipher the false trails and avoid the false scents.
One final note about our definition of data as measurement of something. The second part of the definition tell us that the data is not the thing. The anthropologist Gregory Bateson had a wonderful phrase: "Don't confuse the map with the territory." Data is part of the map and should not be confused with the territory or the reality. In data science (as in all science) there is an intervening layer, the map or "model" of reality. But any model of reality is always one possible model of reality. Models are also inherently approximate and probabilistic. This means that data is not only error-prone but goes hand-in-hand with a set of assumptions about how the world works. So another thing that data scientists do is to build models. But great data scientists can also tell you the surrounding assumptions that accompany the models.
At McGraw-Hill we want to use data science to build great models that inform great products about learning. In fact, that is the chief activity of the data scientist: build models about some domain of knowledge, test them against the data, clarify our assumptions, eliminate errors, and continuously refine the models. Initially, the work of the data science team is more modest: What data can we surface in the form of insights and interactive visualizations, that will empower the learner, the instructor, the administrator, and the parent?
I want to finish with a point not just about data but everything that we do in the digital realm. McGraw-Hill's Chief Digital Officer, Stephen Laster, has observed that if Amazon's or Apple's data about me is incorrect, the consequences are likely to be shallow. If Amazon, for example, makes an "incorrect" recommendation about a product I might want to purchase, most of us I think can live with that. But in the world of personalized learning what if our data, models, and assumptions about the learner and the context of learning are wrong? The stakes, of course, are much higher and the consequences to the learner can be profound. When it comes to learner data what is an acceptable margin of error, and how do we keep any margins of error as small and transparent as possible?