Enhancing Machine Learning with BigQuery ML and XGBoost
Written on
Introduction to BigQuery ML
Data scientists and engineers frequently encounter obstacles when handling extensive datasets, particularly in the context of training machine learning models. BigQuery ML, part of Google Cloud's robust toolkit, enables users to train and deploy machine learning models directly within BigQuery.
Benefits of Using BigQuery ML
BigQuery ML presents numerous benefits, including:
- The ability to work with large datasets without the need to load them into local memory, facilitating model training.
- The use of SQL, which promotes effortless collaboration among team members.
- Instant availability of trained models for making predictions, eliminating the necessity for further deployment actions.
Overview of the Tutorial
In this guide, we will navigate through the following steps:
- Establishing and creating a dataset in BigQuery.
- Developing a binary logistic regression model using the CREATE OR MODEL statement.
- Assessing the model's performance through the ML.EVALUATE function.
- Generating predictions with the ML.PREDICT function.
Finally, we will explore enhancements by adjusting the model and parameters to incorporate XGBoost.
Setting Up Your Dataset
This article illustrates a practical example of how BigQuery ML can assist you in constructing a binary logistic regression model utilizing the Google Analytics sample dataset to forecast whether a website visitor will complete a transaction.
To begin, you need to set up a BigQuery dataset for your machine learning model. Follow the step-by-step guide to create a new dataset in the Google Cloud console.
Preview of the Dataset: "ga_sessions_20170801"
Creating the Model
Next, we will create a logistic regression model for classification. In a new query tab, we utilize the CREATE MODEL statement to define the model, select features, and specify the training table range.
In this instance, the model will predict transactions based on device operating system (os), whether the device is mobile (is_mobile), the country of the user, and the number of pageviews.
CREATE OR REPLACE MODEL bqml_tutorial.sample_model
OPTIONS(model_type='logistic_reg') AS
SELECT
IF(totals.transactions IS NULL, 0, 1) AS label,
IFNULL(device.operatingSystem, "") AS os,
device.isMobile AS is_mobile,
IFNULL(geoNetwork.country, "") AS country,
IFNULL(totals.pageviews, 0) AS pageviews
FROM
bigquery-public-data.google_analytics_sample.ga_sessions_*
WHERE
_TABLE_SUFFIX BETWEEN '20160801' AND '20170630'
This query generally takes around three minutes to execute. Once completed, you can find a new section labeled Models (1) under your bqml_tutorial table in the BigQuery console, where your model is stored. Some graphs will automatically generate to show training details.
Evaluating the Model
In machine learning, a classifier consists of a set of target values assigned to labels. In this case, we employ a binary classification model to identify transactions, where the label column includes two classes:
- 0: indicating no transaction
- 1: indicating a transaction occurred
To evaluate your model's effectiveness, utilize the ML.EVALUATE function.
SELECT
FROM
ML.EVALUATE(MODEL bqml_tutorial.sample_model, (
SELECT
IF(totals.transactions IS NULL, 0, 1) AS label,
IFNULL(device.operatingSystem, "") AS os,
device.isMobile AS is_mobile,
IFNULL(geoNetwork.country, "") AS country,
IFNULL(totals.pageviews, 0) AS pageviews
FROM
bigquery-public-data.google_analytics_sample.ga_sessions_*
WHERE
_TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))
The evaluation query executes almost instantaneously, resulting in various metrics such as:
- Precision: Measures the frequency of accurate positive predictions.
- Recall: Indicates the proportion of correctly identified positive labels from all possible positives.
- Accuracy: The ratio of correct predictions made by the classification model.
- F1 Score: A harmonic mean of precision and recall, with 1 being perfect and 0 the worst.
- Log Loss: A metric representing the difference between the model's predictions and the actual labels.
- ROC AUC: Reflects the likelihood that the classifier will rank a randomly chosen positive example higher than a randomly chosen negative one.
Making Predictions with Your Model
With the model evaluated, you can now utilize it for predictions through the ML.PREDICT function.
Predicting Transactions by Country
Here, we will predict the number of transactions by country and purchases per user.
SELECT
country,
SUM(predicted_label) as total_predicted_purchases
FROM
ML.PREDICT(MODEL bqml_tutorial.sample_model, (
SELECT
IFNULL(device.operatingSystem, "") AS os,
device.isMobile AS is_mobile,
IFNULL(totals.pageviews, 0) AS pageviews,
IFNULL(geoNetwork.country, "") AS country
FROM
bigquery-public-data.google_analytics_sample.ga_sessions_*
WHERE
_TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))
GROUP BY country
ORDER BY total_predicted_purchases DESC
LIMIT 10
The prediction query runs in about one second, yielding a table of the top ten countries with the most predicted purchases.
Purchases Per User
Next, we will forecast the number of transactions each website visitor is likely to make.
SELECT
fullVisitorId,
SUM(predicted_label) as total_predicted_purchases
FROM
ML.PREDICT(MODEL bqml_tutorial.sample_model, (
SELECT
IFNULL(device.operatingSystem, "") AS os,
device.isMobile AS is_mobile,
IFNULL(totals.pageviews, 0) AS pageviews,
IFNULL(geoNetwork.country, "") AS country,
fullVisitorId
FROM
bigquery-public-data.google_analytics_sample.ga_sessions_*
WHERE
_TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))
GROUP BY fullVisitorId
ORDER BY total_predicted_purchases DESC
LIMIT 10
This will provide a table of the top ten users with the most predicted purchases.
Improving Results with XGBoost
BigQuery ML has certain limitations regarding the types of models it supports. Here’s a list of some built-in models:
- Linear regression (LINEAR_REG) for predicting continuous variables.
- Logistic regression (LOGISTIC_REG) for classification tasks.
- K-means (KMEANS) for category identification.
- XGBoost (BOOSTED_TREE_CLASSIFIER) for implementing machine learning algorithms based on the Gradient Boosting framework.
To enhance our previous outcomes, we will try using XGBoost. I recommend the following options for BOOSTED_TREE_CLASSIFIER:
- L2_reg = 0.1
- num_parallel_tree = 8
- max_tree_depth = 10
CREATE OR REPLACE MODEL bqml_tutorial.sample_model_2
OPTIONS(
model_type='BOOSTED_TREE_CLASSIFIER',
l2_reg = 0.1,
num_parallel_tree = 8,
max_tree_depth = 10) AS
SELECT
IF(totals.transactions IS NULL, 0, 1) AS label,
IFNULL(device.operatingSystem, "") AS os,
device.isMobile AS is_mobile,
IFNULL(geoNetwork.country, "") AS country,
IFNULL(totals.pageviews, 0) AS pageviews
FROM
bigquery-public-data.google_analytics_sample.ga_sessions_*
WHERE
_TABLE_SUFFIX BETWEEN '20160801' AND '20170630'
Creating an XGBoost model generally takes longer than a simple linear regression model, around nine minutes. Afterward, we will evaluate our model to gauge our success.
SELECT
FROM
ML.EVALUATE(MODEL bqml_tutorial.sample_model_2, (
SELECT
IF(totals.transactions IS NULL, 0, 1) AS label,
IFNULL(device.operatingSystem, "") AS os,
device.isMobile AS is_mobile,
IFNULL(geoNetwork.country, "") AS country,
IFNULL(totals.pageviews, 0) AS pageviews
FROM
bigquery-public-data.google_analytics_sample.ga_sessions_*
WHERE
_TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))
The results indicate slight improvements, as evidenced by the updated metrics for precision, accuracy, and ROC AUC, while log loss has also decreased.
Conclusion
BigQuery ML serves as a remarkable tool for expediting your machine learning processes, especially with large datasets. By following this tutorial, you will acquire practical experience in creating, assessing, and deploying ML models using BigQuery ML.
Try it out and discover how it can optimize your machine learning projects and foster collaboration within your team.
If you found this format helpful, please consider clapping and following my profile to encourage more content creation! Also, subscribe to my newsletter for timely updates on new materials.