01. Machine Learning Lesson 1.01: Introduction - What is Machine Learning?
OK, so what is machine learning? There are lots of definitions out there and not everyone agrees on what it actually means. But there's a few definitions that I think are pretty solid so we'll go through them now. Algorithms and statistical models that computer systems use to perform a specific task effectively without using explicit instructions. Another definition is, a set of methods that computers used to make and improve predictions of behaviors based on data. Another definition is, machine learning is the science of getting computers to learn and act like humans do and improve their learning over time in an autonomous fashion, by feeding them data and information in the form of observations and real world interactions. So basically, without machine learning we have to write the program. We have to program the computer to do exactly what we want. And with machine learning, we're basically using these special algorithms that people have come up with to just feed the computer a bunch of data and have it sort of learn the patterns or learn the program on its own.Back to Top
02. What Does Someone Who Does Machine Learning Actually Do?
So what does someone who does machine learning actually do? Again, this is something that is not agreed on by by everyone and there are a lot of different thoughts on what a machine learning practitioner actually does. One of the common things that comes up a lot is that it's sort of a combination between statistics and computer science. Some people say, if you use machine learning in your work, you're probably someone who is not quite as good at programing as a real computer scientist and not quite as good at statistics as a real statistician but you can do both kind of reasonably well. And I would say in the energy industry in particular and probably other industries too, I would add that domain knowledge is a really key part of being kind of a good machine learning practitioner, as well as good communication skills. So if you can't take what you've built, take your model or whatever and understand it in the context of the technical space you're working in. So in our case, if you're building machine learning models for reservoir engineering, if you don't know anything about reservoir engineering your model is probably going to be lacking in some way. And then the other angle of that is you have to be able to communicate it effectively to other reservoir engineers and managers. So I would say for effective machine learning and in the energy industry, domain knowledge and good communication are at least as important as those first two. And there's lots of other things. I guess, in my mind, one of the things that makes a really good machine learning practitioner or data scientist is you need to be curious. So if you're the kind of person who, you get told something and you just accept it at face value and you're not kind of interested in understanding why happens or what alternative interpretations might be then you might not get as far in the machine learning space as someone who's really curious, willing to experiment, is patient with iterations, has maybe some hacking skills or at least kind of likes the idea of hacking things together. And then there's math and probability, which kind of relates to the statistics piece. And software engineering, which obviously you have to at the end of the day, write some code to make it all work.Back to Top
03. Machine Learning (MI) vs Artificial Intelligence (AI)
What is machine learning versus artificial intelligence? This is again, something that has a relatively low degree of agreement out there and there is no widely accepted definition. AI, artificial intelligence, one description is it's a science or approach to developing technology that works like a human. So it learns by experience and it kind of has that element of the you know, it's like us, it's like our brain or it's kind of that conceptual label that people put on things. Machine learning, most people would consider it a subset of AI and it focuses on algorithms that learn useful patterns or models from datasets. And then there's deep learning, which is a newer term, not that new anymore. But it's basically just machine learning with multi-layered neural networks. So neural networks, which are one type of machine learning algorithm or structure, neural networks with lots of layers are usually considered deep learning. And one of the key factors with deep learning is it requires lots and lots of data. So in oil and gas, with most of the data that we're used to working with, if you take seismic out of the equation, I would say and maybe image processing, usually deep learning is less applicable certainly to the type of the stuff that I'm going to focus on in this course for the most part.Back to Top
A fun definition of AI versus machine learning is that if it's written in Python, it's probably machine learning. And if it's written in PowerPoint, it's probably AI. And that's just explaining how, if you're trying to sell it, if you're trying to get funding, if you're a CEO trying to tell the world that your company is on the cutting edge, you're gonna call it artificial intelligence. If you're someone who's actually working on this stuff, usually you'll call it machine learning.
04. What is a Data Scientist?
So what is a data scientist? Lately, I would say a data scientist is the name for anyone in a technical profession who's currently job hunting. And that's because I know a lot of people right now and if you look at the number of data scientists out there on LinkedIn or wherever else, it seems to be just absolutely exploding. So, a lot of people who used to be technical professionals in some other field are now simply just calling themselves data scientists because they think it'll be an effective way to get a job. And I can't blame them. And honestly, maybe you could argue that I'm someone like that because I have no formal data science training. Really, what it is, I guess again there's no one definition, but I would say it's someone whose role combines statistics, computer science, math and ideally domain knowledge. It's also, I would say, someone who can kind of wrangle data and extract meaning from it and communicate that meaning to other people using sound defendable methods and in a language that they can understand. So right now, data science and data scientist will mean a lot of different things to a lot of different people. Again, there's no one commonly accepted definition. There is even substantial disagreement among people who are well known recognized data scientists as to what a data scientist is. There are also lots of articles online complaining about how everyone is calling themselves a data scientist when they aren't really a data scientist. And that only a small minority of these people actually meet what the particular author considers to be a legit data scientist. The problem again is that different authors have different ideas of what is legit, and sometimes those don't even overlap. There are also other related rules out there in addition to data scientists, like data engineer, data analyst, probably lots of others. And there's no one governing body that hands out data scientist licenses. Like here in Alberta, we have a APEGA that basically says you are a professional engineer and you're licensed or you're not. And there's nothing like that right now for data scientists. So there's nothing stopping anyone from calling themselves a data scientist. And again, for me, I think I probably would meet most definitions of a data scientist, but I actually have no idea.Back to Top
05. What Are Some Qualities of a Good Data Scientist?
So what are some qualities of a good data scientist? In my mind domain knowledge, for a data scientist to be useful I think, is absolutely key. So in the oil and gas industry, if you want to be useful as a data scientist in the industry, I think it's important to know something about the industry. So again, whether you're using data science and machine learning techniques for geological understanding, reservoir engineering understanding, production performance prediction, optimization, it's important to know something about the terminology and the physical realities of what we're talking about.Back to Top
Good communication skills are an absolute must. If you build the best model in the world, but you can't communicate the way that it works and the results that it suggests to anyone, it's never going to be used. And maybe it was a fun academic exercise, but that's probably all it's ever going to be.
I think someone who's naturally curious is going to do a good job sort of uncovering maybe the more hidden or interesting truths to be found out there. Someone who's constantly learning, self-improving, keeping up with the tools and techniques. More than anything else, I think right now this field is one where it's important to understand the new techniques being developed. The new libraries that are out there, if you're an actual programmer. Or even just the new types of visualizations and communication tools out there, if you're someone who's working with a data scientist or even just consuming what comes out of a machine learning or data science workflow.
I think a good understanding of math and statistics is important. If you're a really good programmer but you're not sure about what the results mean, whether it's significant or not, what maybe some of the spurious correlations could look like and some of those pitfalls that can happen. I think having a good understanding of math and statistics is a good way to kind of avoid those and keep yourself as safe as you can be from those kinds of things.
Strategic thinking is good, both in model development, watching out for pitfalls and in how to communicate the results. Sometimes the most technically accurate result and the description of that is not the way that is the best way to describe that to your CEO.
Someone who understands the theory behind the tools and techniques. The assumptions involved, appropriate applications of all the different algorithms and tools and techniques and even visualizations I think is important. So a lot of that comes just with experience. So one quality, I guess of kind of being good at anything, is that you have some practice. So I think that's important there.
And one thing that I like to highlight, that in terms of an approach that I think works well, is a hypothesis driven approach. So this is where the domain expertise starts to blend with the data science where you might have as a reservoir engineer let's say. You might have an idea about something that if porosity is higher in the rock, then I think the rock will be more productive in this way or that way or whatever. That might be a hypothesis and then we go into the data science workflow and we say, does that bear out in the data or not? And I think that's a good way to kind of look at the world, because if we don't understand what we might be looking for ahead of time, we might again fall more easily victim to spurious correlations and things like that. And so that hypothesis driven approach with the willingness to experiment and test and iterate, I think is something that can help data scientists certainly in the oil and gas space get far.
06. Why is Machine Learning so Hyped? Is it Legit?
So why is machine learning so hyped? Is it a legitimate technology that we can expect to be around for a long time? Is it something that's just the latest fad and that's why we're making a video about it? Or is it something that's here to stay? So back in 2015, this is the Gartner Hype Cycle for Emerging Technologies and machine learning is pretty close there on the "peak of inflated expectations" is what they call it. So it's arguably just starting its descent down into the trough of disillusionment. And for the technologies that kind of make it, the idea is that they will then turn the corner at the bottom of the trough of disillusionment down here and start to make their way up the slope of enlightenment and they will reach a plateau of productivity. So this is obviously a hilariously simplified way of looking at the world but the thing to note, I guess, is that back in 2015 machine learning was more or less at the peak of inflated expectations.Back to Top
2016, machine learning actually went backwards up the slope back to the very, very tip top of the peak of inflated expectations. So whether you think that's just because it got even more hyped or because maybe it has a little bit of staying power, I guess that's still open to interpretation.
2017, we see that it's still right up there and now it's joined by a neighbor called deep learning, which is more or less just a subset of machine learning. And they are still right at the top of the peak of inflated expectations.
Now we have deep learning and deep neural nets, I guess in particular, so the terminology is changing a little bit. We see machine learning not even on the chart anymore, it's been replaced by deep learning, but that's still at the peak of inflated expectations. So I guess my interpretation of this is that
is machine learning at a permanent peak on the hype cycle? And if it is and it's been 5 or so years now, I would argue that maybe there's something to it. Maybe there's a reason that it's being so hyped and that it's been spending 5 years as being one of the biggest focuses, both in the certainly in sort of the technical media, but even the mainstream media out there. So if you look at surveys across industries, an overwhelming survey response, this was from 2018 I believe, indicates 80 to 90% plus majority agree with the statement "machine learning or artificial intelligence is making a business impact in our industry today". So if it was just hype, I would say that there probably would not be that level of survey response from management in a wide variety of industries.
Machine learning has demonstrated an ability in many applications across industries to either improve performance relative to traditional methods or actually match human performance, but substantially increase efficiency through automation and consistency. So if you take something that a person can do and now you can do it automatically and repeatedly and maybe 24 hours a day, 7 days a week and with ultimate consistency because now a computer is doing it, that can have a lot of advantages. Even if you can't actually say that the machine is doing a better job than a person, it can still be a useful thing. And machine learning has actually demonstrated an ability to exceed human performance in many tasks. So things like image and speech recognition, translation between multiple languages, interpreting radiology scans, playing chess, playing go. There's currently deep neural network programs that have essentially been able to beat any and all human challengers at chess and go. And even Starcraft 2, which is a real time game where things like judgment matter, reflexes matter, and most importantly, it's a game where it's a game of imperfect information. So there's 2 players who are trying to basically wipe out each other's armies and bases and you can't see their base until you go and explore it. And so this is a whole new challenge for a machine learning and artificial intelligence system, is to actually figure out how to play a game where they don't have all the inputs readily at hand like they do in chess or go where you can see the whole board at once.
07. Different Types of Machine Learning
So there's a few different types of machine learning. The three categories that machine learning, are usually kind of lumped into our supervised learning, unsupervised learning and reinforcement learning.Back to Top
So supervised learning is where the machine or the algorithm learns explicitly. So we feed the computer some inputs and we tell it what the outputs should look like for those inputs. So we kind of give it the answer, we give it the solution. And then we say, go and learn the relationships that map these inputs onto these outputs in the best way you can. And that's supervised learning. So supervised learning requires actual data from us both in an input and an output sense.
Unsupervised learning is where we might just be interested in finding patterns in the data or structures. So we can say, I have all of this data on 10,000 wells and I just want to know what patterns might exist in the dataset. I might not even know what kind of patterns I might be looking for, but I just know that there's probably some order to that data that maybe I want a computer to help me find automatically. So that's where unsupervised learning comes in. Probably the most widely used form of unsupervised learning is clustering. So in this case, usually the evaluation of the data is more qualitative than on the supervised learning side. With supervised learning, we're usually trying to make predictions. With unsupervised learning, we're kind of just trying to understand things.
Then there's reinforcement learning, which is a little bit more of sort of the behavioral side of machine learning rather than the predictive side. And reinforcement learning is where we have an algorithm that learns based on sort of a reward or a penalty system or some kind of positive or negative reinforcement system. Ideally, we have sort of an environment where the machine or the program exists or kind of lives in. And what we're trying to do is get it to learn a certain type of behavior that maximizes its reward system. So whether that's maximizing the score in some way, whether that's minimizing the amount of errors it makes, those are those are various ways that we can apply reinforcement learning. Reinforcement learning is also usually, at least in today's world, restricted to neural networks. So if we're talking about reinforcement learning, there's probably a neural network involved. And in the next lesson, we're going to focus on supervised learning, so I'll see you there.
08. What Are The Consequences of Not Integrating Domain Knowledge Into The Machine Learning Process?
I think if you don't integrate domain knowledge, it's easier to fall victim to a few different kinds of sort of traps. First of all, spurious correlations are easy to come by but maybe not recognize that they're spurious. And so something like the phase of the moon affecting your well productivity results is something that maybe might show up as a strong feature in the model when obviously we know that there's no causal link there. The other thing is it's difficult to know what features that we might be able to come up with out of the data. So some engineered features that we might take the raw data and say, if we look at it like this, that's probably physically relevant. Or that's maybe a good way to look at things that would be intuitive and interpretable to someone who's looking at the results of the model. If we don't have domain expertise, it's really hard to kind of come up with those features. There are automated feature engineering methods and algorithms out there, but they usually sort of just try to throw everything at the wall and see what sticks. And sometimes what sticks is, again, very unintuitive or just sort of not particularly meaningful to someone who's interpreting the results. And usually we can find better links to causality with domain knowledge. So if we're trying to understand what actually drives performance for a well or what is actually predictive, let's say of core porosity or something like that. With domain knowledge, we can sort of piece together what we know from the physical relationships and see if that's showing through in the model. Or if we think that some things are shining through in the model that maybe we strongly suspect are not causal, we might want to immediately go investigate those or filter those out to make sure we're not falling into a trap there. And really at the end of the day, it's also important and just the ability to convince stakeholders of the value of the analysis or the predictions generated from the models. So if our features and the whole process is rooted in domain knowledge, we can we can communicate the modeling process in those terms and we can link it back to some of the maybe first principles that are already widely accepted and make it much easier to get buy-in from management, let's say, in actually using this model to make decisions.Back to Top