The previous blog “Data Science, Machine learning, Business Intelligence – Demystified” discussed the basic conceptual foundation of machine learning in the context of data science. This blog focuses on the available tools/services/software platforms to perform data analysis using machine learning on Google Cloud Platform GCP.
Technically, today there are various machine learning models, algorithms and services available to consume when using GCP. Although such variety is great, it may seems confusing if you cannot categorize and map the actual use case and business requirements, to what GCP has made available for the different use cases.
Let’s analyze the above statement by considering couple of examples.
In all use cases, we have data that referred to as: input-data.
Let’s assume we need to have labels of vehicles (car, bus, etc.) in a photo, and we may or may not have these labeled photos.
With GCP life can be easy here, as we can utilize a number of APIs or application programming interfaces/endpoints, where we are able to ingest our input-data from a GCP storage bucket (cloud storage), into this API endpoint (with photos we typically use the GCP Vision API). The output will be the labels, generated for us by the GCP Machine Learning algorithm. The nice thing here, is we don’t need to be ML experienced as we don’t really need to select the algorithm, train a model, etc. it’s simply like a black box that does the job for us in this use case. In fact, what we really need to do here, is to understand the output data/photos along with the likelihood of correctness for that specific use case scenario.
Then based on that if a photo is labeled as 60% likely of being a car, is this something acceptable for the business use case or expectations?we may find some use cases like social media, the acceptable correctness should be above ~75%. In contrast with medical applications, the percentage of likelihood must be much higher e.g. 95% or higher. So the question here, How we could obtain a higher percentage?, simply we may need to create our own model. In other words, each use case need different ML services and capabilities. This blog will summarize it, in a simplified way as illustrated in the figure below.
At the top of the figure above, we have the ML software platforms offered by GCP, that we can consider it as SaaS ML.
ML APIs can be thought of plug and play ML services, in which you provide (ingest) your data and
GCP ML APIs allows developers to extract actionable insights from video, Photos, text etc. without requiring any machine learning knowledge or skills. Taking advantage of a massive library, labels and pre trained models. For example with the Cloud Vision API, “The API quickly classifies images into thousands of categories (such as “sailboat” or “Eiffel Tower”), detects individual objects and faces within images, and finds and reads printed words contained within images”
As highlighted earlier in some cases, the accuracy level or might be some special cases requires a more customized model or labels, here where the AutoML ( such as AutoML Vision ) helps in building and training custom ML models with minimal ML expertise to meet domain-specific business need.
According to GCP “Cloud AutoML is a suite of machine learning products that enables developers with limited machine learning expertise to train high-quality models specific to their business needs, by leveraging Google’s state-of-the-art transfer learning, and Neural Architecture Search technology”
In addition, with BigQueryML GCP is democratizes machine learning, by enabling data analysts to use machine learning through existing SQL based, business intelligence tools and skills. At the time of this post writing, it supports the following types of ML models: Linear regression, Binary logistic regression and Multiclass logistic regression for classification that can be models can be used to predict more than two classes such as whether an input is “low-value”, “medium-value”, or “high-value”.
The other ML service category offered by GCP, is the Cloud ML Engine. First of all, Cloud ML Engine is not a SaaS platform that you just upload you data to and then you can start using it like the Google ML APIs or AutoML. Instead, Cloud ML can be thought of as platform as a service PaaS. How?
So as we start considering a custom ML models, here we are not only concerned about accessing an endpoint/API, however, we need to start thinking about the entire ML model workflow, as a process. Technically we need to consider at least, the following steps:
Its obvious taking this path is more complex, as it involves more steps and requires ML expertise, compared to the ML API/AutoML. This is simply because, there is a big difference in the level of complexity here (accessing an ML API endpoint Vs. building, tuning, deploying and maintaining a complete custom model).
Therefore, the business situation should warrant that you need this level of complexity.
The role of GCP ML Engine here, is like a PaaS, where GCP spins up the underlaying environment required to run the training and production models across its cloud, however, by ML engine itself it does not do ML, you as the ML specialist or Data scientist need to code that.
Practically, although, creating a graph in TensorFlow to train on the ML Engine is key to developing an application. “But what’s the point of a powerful prediction model that only a data scientist can use?” In IoT/big data use cases with predictive analytics, the goal is to obtain real-time predictions that could feed into a dashboard or other application layers to perform other functions. In order to do so, the models need to be accessible from other Cloud services or applications such GCP Cloud Functions written in Node.js. While This model could be built and written in Python.
In this case, GCP ML Engine, offers the ability to deploy the model as a RESTful API to provide prediction at scale and makes the model available to all sort of clients, whether we are dealing with a single or millions of users. This should be applicable for both, online and offline predictions.
The figure below from GCP, illustrates, where the Cloud ML Engine provides managed services and APIs as part of the ML workflow (the blue-filled boxes indicate where Cloud ML Engine provides managed services and APIs)
According to GCP, ML Engine offers key advantages when running TensorFlow:
As highlighted earlier in this blog, as part of the custom model built, data exploration and preparation including preprocessing is not as simple as uploading data to ML API. Also, during the design, build and evaluation of the customer model, there is always a need to look into some hyperparameter tuning, and features’ engineering tasks are required to enhance the model. Jupyter notebooks are a great proven tool data preparations, because they’re easier to share with subject matter experts as they include text annotation and visualizations in addition to the actual runtimes. In GCP Cloud Datalab can run gcloud commands directly from its UI and run Jupyter Notebooks in a managed environment. Cloud Datalab comes with ML Workbench, a library that simplifies interactions with TensorFlow.
The following are some of the tasks you might want to perform (according to GCP):
When it comes to data exploration and preparation, at a higher scale and with complex Extract, Transform and Load ETL functions GCP Cloud Dataprep, by Trifacta is an intelligent data service for visually exploring, cleaning, and preparing structured and unstructured data for analysis that you can consider, which allows you to visually explore, clean and prepare data that is not ready for immediate analysis. It can automatically detects schemas, datatypes, possible joins, and anomalies such as missing values, outliers, and duplicates so you get to skip the time-consuming work of profiling your data and go right to the data analysis.
Dataprep uses Apache Beam behind the scenes, but it saves a lot of boilerplate code with its simple GUI. The Apache Beam tasks can run on Cloud Dataflow, which can help you develop and execute a wide range of data processing patterns, including ETL, batch computation, and streaming computation.
When using Cloud Dataprep for exploring, cleaning, and preparing, we still need to use Cloud Datalab for the model build, splitting data for training, evaluating, and testing, and running the the model. Although, we can use it for manual hyperparameters tunings, according to GCP “The preferred approach is to tune hyperparameters using ML Engine. ML Engine tunes hyperparameters automatically based on a declarative YAML setup. The system jumps quickly to the best parameter combinations and stops before going through all the training steps, saving time, compute power, and money”.