If I ask you what is shown in the picture below? I’m sure you’ll say this is a Cisco Device and you may be able to recognize it as a router.
This could be a new router model, and the most important point here is that you don’t have to know this exact specific model to know that this is a router or not.
You many have years of experience working in IT or networking in specific, and having seen many routers before, it means you’ve already built a logic in your head to recognize what a router is and what a router is not. So all this logic might look like a program like you see up here a device, with a few Ethernet or fiber interfaces, possibly Cisco logo, etc. however, you may not has seen a router with a 3G/LTE antenna, but with the experience that you have, you can differentiate this is a router or not. also, the key point here is that every new router model you see adds to your experience.
Logically, in this simple scenario, you have lots of input data, these are the routers that you have seen in reality or as pictures, in which every new router that you see the entire data set of routers goes through a certain computation in your brain and then you create a model for a router how it should be, and then when you’re shown a new picture of a router model that you’ve never seen before you can recognize it. But again the important point here is that you’re able to feed each new router you see into the input data and able to improve your model through computation.
Based on that, AI in simple words, can be thought of as a step beyond programing for each and every single input and output (if > then> else loops etc.) where machines or softwires can improve overtime based on the new inputs they receive to take action “computers can learn without explicitly telling the program to do something” and this what is referred to as machine learning, which is learning from the data that you specify and the inputs optionally that a human may specify without explicitly being programed to do so. For example a recommendation system on amazon website or Netflex, is a typical an AI enabled system, that relies on certain ML algorithm, such as decision tree classification to provide recommendations to the website visitors. Another simple and common example of AI, is when you take a photo using your smart phone device, the software has algorithm that is able to recognize a face of a person, from a non-face (highlight it in a square) and then automatically adjust the lighting and focus around the faces. Today there is a huge number of software applications that uses ML capable software platforms in the back to provide AI functions to end users.
Please note, that software platform refers to a technology or a platform that was built to provide a generic set of capabilities that can be utilized by others to build a specific software applications for certain use case scenarios such as a cloud ML engine that can be utilized by software applications for a specific use case, in which the software application is the element that is built to service a certain use case such as a CRM.
Today in the era of digital transformation, IoT, smart cities, and the cloud, it’s all about the data and how to get value of data. With the massively growing number of connected devices, Internet of things and emergence of artificial inelegance we are encountering what is called “data Tsunami” the common representation to this data Tsunami is the term big data, which is commonly characterized by its Volume (more data), Variety, (more types of data), Velocity (the speed at which data arrive or being ingested).
In addition to that, in big data projects, you may see more types of data, like file data, pictures, videos etc.
Such variety and scale of data have changed the way data being stored by software platforms. Traditionally, relational databases were designed and used for structured data or data that will go into tables. Today we are dealing with semi-structured and unstructured data at scale. This increased variety of data is a key driver of the evolution and selection of NoSQL database solutions.
In addition, there’s graph data type, where there’s a number of nodes and the relationships between them. An example of this would be social media, something like Facebook, that required another type of software platforms (database) to store it.
As a result, the new trends in data ‘big data’ are allowing us to ask new type of questions. These questions are more predictive, rather than business intelligence focused. what does this mean? Business intelligence provide deterministic analytics, or the same result over and over based on historical existing data. How many black shoes did company X sell in a city in a certain time?
On the other hand, Predictive analytics give probabilistic results. What’s the probability in the future that company X going to sell a certain amount of shoes in the same place at the same time?
Such predictive model often involves another kind of data, commonly referred to as behavioral data as additional dimension.
For instances, where did the customer walk when he was in the shopping center? What type of shoes did he or she search for, in the store?
In other words, this type of data refers to things that were not a transaction, or aspects of data, but yet are interesting and can help in predicting future behavior. So the idea here is to model and to predict. Machine Learning algorithms, are the key enabling tool, to obtain predictive analytics.
However, before starting with the data analysis and processing, first there must be a pre-defined purpose that aims to produce measurable outcome(s) from input data, that ultimately should help to make a business decision (data driven decision). The process of defining the purpose, analyzing and presenting the outcomes is called data science.
It is important to note that data science tools and techniques used will not make some as a data scientist. Instead, it’s the scientific techniques, and not the tools themselves, that make someone a data scientist.
However, without the different tools and techniques, it is difficult today for someone to get an Insight from data and become a successful data scientist. The tools used in data science, can be categorized: data storing (spreadsheets, databases and key value stores), data scrubbing (use text editors, scripting tools, and programming languages like Python), and data analysis (statistical packages and ML models like R, and Python’s data libraries)
If we look at the above data science process, Acquisition refers to anything that pertain to obtaining the data which can take different forms to accessing or moving data. In data acquisitions, databases come into play as data sources. A variety of such databases need to be integrated with to acquire data in both real time and batch modes.
And then the mechanisms to transport the data ( this could be batch or streaming ingestion of data) among DBs or from different DBs/sources such as IoT Edge or regional DB to a central DB.
Once the Data is obtained, next it should be prepared, this stage includes: exploring samples of the data to understand its nature, find the some meaning, as well as evaluate its quality and format. following this exploratory analysis sub-task, sometime a pre-processing is required for cleaning data, filtering, re-modeling raw data, as well as may include integration of multiple data sources.
Next is where the data analysis step takes place, here where different analytical techniques can be used, in specific machine leaning models and algorithms.
Data analysis involves building a model from the obtained data. In ML terms, It’s the application of statistics, and supervised or unsupervised data analysis and mining.
Supervised means you have a known set of data, unsupervised means you take the data you have, and then you apply a data analysis or mining algorithm.
Generally, there’s three main or common types of ML algorithms > regression, which predicts a future value, classification, which predicts the membership in a group. Are you in group A or B, or C or D? And there’s clustering, which is unsupervised > what items appear to be together at what frequency.
In fact, at this stage, machine learning also it has its own process.
According to Google Cloud: To develop and manage a production-ready model, you must work through the following stages:
These stages are iterative. You may need to reevaluate and go back to a previous step at any point in the process.
Last but not least, in this blog we looked at machine learning from the lenses of data science, in which the outputs need be evaluated and presented in a visualized manner. This is where the data scientist need to look back at the driver of the entire data science activity “reporting insights from analysis and determining actions from insights based on the purpose that was initially defined” also this is what we commonly hear about “ data driven decision”
Subsequent blogs, going to focus on data science and ML on Google Cloud, AWS and Microsoft Azure, highlighting the different available options and services that can be used as part of the data science and ML processes.