The size & scope of your data

IoT with Robert Welborn

Robert Welborn

The most fascinating thing about the scope and size of big data is realizing the vast majority of data that ended up in these big systems were from companies that automated a form.

Years ago you had a call center employee, somebody would call them, and they would fill out a piece of paper. Over time, the company invested in software and they basically turned that form into an application. It was a person talking to a person filling something out. There was never a point at which you were collecting so much information that somebody got tired of typing.

Along comes the sensor

Then along comes the sensor. The sensor collecting data doesn't care about the form. It doesn't care about anybody typing. It's just collecting everything. Somebody says at some point, "How frequently should we turn the thing on to collect data? All engineers invariably ask, "Well, how fast can I turn it on?" If you can turn it on every millisecond or every hundred milliseconds engineers will  collect everything because they don't want to miss anything.

We engineered big data from the sensor point of view to be millions of times bigger than the data we had collected traditionally beforehand. It's not even on the same scale. Once you start writing that down and joining it with other sets of data, it is now millions times millions to the millionth power bigger, so it's tremendous data sense.

Data insight about behavior

The interesting thing from the engineering standpoint was the individual point that was there. This engine is performing the way I wanted it to. It's too hot. It's too cold. The security sensor turned on. It turned off. What was really fascinating was when you actually start looking at those things and studying trends and understanding what it means when I look at the doors opening and the doors closing and the number of customers that are coming into my store. What does it mean when I correlate the doors opening and closing with the advertising that I'm doing? I can say if the door opens 15 times in a week and it opened 14 times last week, then it means that people are actually responding to my ad. That's an insight that never existed before. Now in the age of IoT, we can extend that silly, dumb sensor into an insight about the behavior of everyone around them.

When we start getting into image, audio, and video data– the non-structured data, it’s infinitesimal. There’s structured data, which comes in rows and columns. It's very big. I can't remember the actual stat on the number of cat videos that appear on the internet every second, but I want to say it's in the order of 130 cat videos every second make it onto the internet. Every one of those things add about 21 kilobytes. So 40% of the internet is essentially us sharing images of cats, and 30% of the world owns a cat, which is really funny. That vastly over indexes dog people watching cat videos.

The massive amount of unstructured data around audio, video and text is just another sensor. The camera is just another sensor. Most cameras are shooting at 60 frames a second if it's high definition. It’s a similar situation with text, which is usually just transcribed from one to those other media.

Talent to meet the growth of scale

Another sensor is collecting exit bites of data nonstop, and the only real reason why you ever stop it in the case of a camera or in the case of audio is you said this experience is over in time for the next one. If so, the rate of which those grow is more problematic because most sensor data, where it's sensing temperature, for example, is easy for me to break this into date timestamp and temperature. It's indexed. It's keyed. It's easy for me to break apart. I understand the stuff that I don't care about in that feed, but in the audio and video and text, I don't. I actually have to run over top of that. The scale of those growing is more troublesome for scientists in IT departments and those sorts of things because of the translation layer that's needed to turn it into something that I can use, whereas temperature is ready to use immediately. The combination of those two things together means that I need to have people who are able to build a translation layer and I need to have another group of people who are able to build an insight layer.

It makes this data boom the most expensive data boom we've ever had because it’s people with PhDs doing this, not necessarily what we've traditionally had in IT shops where there are mostly people who graduated with a computer science degree.

Universal data vs. every atom in the universe

Ten to the 80th is the number of atoms in the visual universe, which is what we can see from all of our telescopes. We are at a point where you can easily get ten to the 80th combinations in just the number of vehicles on the road today and the number of potholes that they're detecting. Every atom in the universe exists and it's stable. If I start collecting data on it, after the first three seconds there will be three times as much data on every atom in the universe as there was. Conceivably, if I collected for years, we could grow far faster. There can be far more data in the universe than there are atoms in the universe, which is ridiculous.

In the case of the Theranos bust, the Feds showed up with a big Amazon web services truck, and they plugged into their data center and downloaded every single exabyte of all the stuff that they tested, if they actually tested anything. The semi truck weighed the same when it showed up as it did when it left, but the difference was it had everything that Theranos had done for 13 years uploaded into it.

It has no weight, but it is exponentially larger in size. As a data scientist, the way we begin to grapple with this is through science. In science, we essentially give structure to unstructured things. If I have unlabeled things, I give them labels. That's basically the fundamental principle of science. I'm going to break it down into pieces that I can describe first, and then compare the described phenomenon with other described phenomenon. From there, I'm going to measure things and try to determine if there are trends and patterns that exist inside of it. It's just the scientific method. The great and awesome thing behind the stupid name that we have of data scientists is that all we are are scientists and technically every other scientist is doing the same thing we are. Outside of social scientists, I'm sure there are some who will say we're far more important than what we do.


Data hypothesis

The long and short of it is that we're going to do experiments within the data. We form a hypothesis, test the hypothesis, and, if the hypothesis is true, we run another series of tests to see whether or not that thing is there. If it's there, then we're going to sink a whole bunch of more experiments and data and trials into it and see whether or not we make money off of it. Unlike at universities, we would sink a whole bunch of grants into it and see whether or not we made PhDs off of it.

AI exists because of the data, and now that the data is in a form and the technology is in a form in which we can train the AI off of the data, and we don't have to make guesses about the way that humans behave and guesses about the way streets are constructed. We can just use the data. We can actually teach the AI how to make these big decisions.

Alan Turing's paper essentially says we can build a baby and train them to do anything. We can build an adult and hope they already know everything, and see how they do things. But what he proposed was to build an adolescent that has some basic skills and train it, and then test it to see how it works. Turing's approach is the single best idea that we've had in the space. The test that we have for each of these things is build something and feed it data. After 1000 data points, is it doing better than it was after it had one data point? After a million data points, it's doing better than after it had 1000? If so, then we can say this is artificially intelligent.

Machine Learning

Part of the reason I prefer the term machine learning is because it's less complicated. For example, if we create a cucumber sorter, and after it seen 1000 cucumbers, it is making fewer errors than it was after it had seen five cucumbers. What you eventually get to is a space in which the cucumber sorter is something that is a learning thing.

In a space in which the cucumber sorters are learning, the most important part of this process is that I didn't write any additional code for it. I wrote the basic piece, and it learned the rest. The human was able to step out of the way and let it finish the work. The rule that we and the machine learner agreed to was that metric. And as long as you're doing that better, then we all agree that's good.

That counts as intelligence. We don't always agree that that's intelligence. Elon Musk is afraid that we're going to tell the cucumber sorter that sorting cucumbers is the most important thing for you to do, and if the person gets in their way, that the artificial intelligence will realize that it needs to kill the person in order to continue sorting cucumbers. Now suddenly it's doing something we don't want it to do in addition to that thing that we want it to do.

And therein lies the measure of intelligence that considers how we create morality and all these other things that are associated with intelligence. This is the point where everything starts falling apart. Do I have to tell it in its quest to sorting cucumbers, don't kill people? That would be good, but that's hard to model. If it's never killed anyone before, why would I tell it not to kill people? If it's never painted one of the cucumbers orange before, how would I know to tell it not to paint it orange?