filzfreunde.com

Enhancing Machine Learning with BigQuery ML and XGBoost

Written on

Introduction to BigQuery ML

Data scientists and engineers frequently encounter obstacles when handling extensive datasets, particularly in the context of training machine learning models. BigQuery ML, part of Google Cloud's robust toolkit, enables users to train and deploy machine learning models directly within BigQuery.

Benefits of Using BigQuery ML

BigQuery ML presents numerous benefits, including:

  • The ability to work with large datasets without the need to load them into local memory, facilitating model training.
  • The use of SQL, which promotes effortless collaboration among team members.
  • Instant availability of trained models for making predictions, eliminating the necessity for further deployment actions.

Overview of the Tutorial

In this guide, we will navigate through the following steps:

  1. Establishing and creating a dataset in BigQuery.
  2. Developing a binary logistic regression model using the CREATE OR MODEL statement.
  3. Assessing the model's performance through the ML.EVALUATE function.
  4. Generating predictions with the ML.PREDICT function.

Finally, we will explore enhancements by adjusting the model and parameters to incorporate XGBoost.

Setting Up Your Dataset

This article illustrates a practical example of how BigQuery ML can assist you in constructing a binary logistic regression model utilizing the Google Analytics sample dataset to forecast whether a website visitor will complete a transaction.

To begin, you need to set up a BigQuery dataset for your machine learning model. Follow the step-by-step guide to create a new dataset in the Google Cloud console.

Preview of the Dataset: "ga_sessions_20170801"

Creating the Model

Next, we will create a logistic regression model for classification. In a new query tab, we utilize the CREATE MODEL statement to define the model, select features, and specify the training table range.

In this instance, the model will predict transactions based on device operating system (os), whether the device is mobile (is_mobile), the country of the user, and the number of pageviews.

CREATE OR REPLACE MODEL bqml_tutorial.sample_model

OPTIONS(model_type='logistic_reg') AS

SELECT

IF(totals.transactions IS NULL, 0, 1) AS label,

IFNULL(device.operatingSystem, "") AS os,

device.isMobile AS is_mobile,

IFNULL(geoNetwork.country, "") AS country,

IFNULL(totals.pageviews, 0) AS pageviews

FROM

bigquery-public-data.google_analytics_sample.ga_sessions_*

WHERE

_TABLE_SUFFIX BETWEEN '20160801' AND '20170630'

This query generally takes around three minutes to execute. Once completed, you can find a new section labeled Models (1) under your bqml_tutorial table in the BigQuery console, where your model is stored. Some graphs will automatically generate to show training details.

Evaluating the Model

In machine learning, a classifier consists of a set of target values assigned to labels. In this case, we employ a binary classification model to identify transactions, where the label column includes two classes:

  • 0: indicating no transaction
  • 1: indicating a transaction occurred

To evaluate your model's effectiveness, utilize the ML.EVALUATE function.

SELECT

FROM

ML.EVALUATE(MODEL bqml_tutorial.sample_model, (

SELECT

IF(totals.transactions IS NULL, 0, 1) AS label,

IFNULL(device.operatingSystem, "") AS os,

device.isMobile AS is_mobile,

IFNULL(geoNetwork.country, "") AS country,

IFNULL(totals.pageviews, 0) AS pageviews

FROM

bigquery-public-data.google_analytics_sample.ga_sessions_*

WHERE

_TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))

The evaluation query executes almost instantaneously, resulting in various metrics such as:

  • Precision: Measures the frequency of accurate positive predictions.
  • Recall: Indicates the proportion of correctly identified positive labels from all possible positives.
  • Accuracy: The ratio of correct predictions made by the classification model.
  • F1 Score: A harmonic mean of precision and recall, with 1 being perfect and 0 the worst.
  • Log Loss: A metric representing the difference between the model's predictions and the actual labels.
  • ROC AUC: Reflects the likelihood that the classifier will rank a randomly chosen positive example higher than a randomly chosen negative one.

Making Predictions with Your Model

With the model evaluated, you can now utilize it for predictions through the ML.PREDICT function.

Predicting Transactions by Country

Here, we will predict the number of transactions by country and purchases per user.

SELECT

country,

SUM(predicted_label) as total_predicted_purchases

FROM

ML.PREDICT(MODEL bqml_tutorial.sample_model, (

SELECT

IFNULL(device.operatingSystem, "") AS os,

device.isMobile AS is_mobile,

IFNULL(totals.pageviews, 0) AS pageviews,

IFNULL(geoNetwork.country, "") AS country

FROM

bigquery-public-data.google_analytics_sample.ga_sessions_*

WHERE

_TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))

GROUP BY country

ORDER BY total_predicted_purchases DESC

LIMIT 10

The prediction query runs in about one second, yielding a table of the top ten countries with the most predicted purchases.

Purchases Per User

Next, we will forecast the number of transactions each website visitor is likely to make.

SELECT

fullVisitorId,

SUM(predicted_label) as total_predicted_purchases

FROM

ML.PREDICT(MODEL bqml_tutorial.sample_model, (

SELECT

IFNULL(device.operatingSystem, "") AS os,

device.isMobile AS is_mobile,

IFNULL(totals.pageviews, 0) AS pageviews,

IFNULL(geoNetwork.country, "") AS country,

fullVisitorId

FROM

bigquery-public-data.google_analytics_sample.ga_sessions_*

WHERE

_TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))

GROUP BY fullVisitorId

ORDER BY total_predicted_purchases DESC

LIMIT 10

This will provide a table of the top ten users with the most predicted purchases.

Improving Results with XGBoost

BigQuery ML has certain limitations regarding the types of models it supports. Here’s a list of some built-in models:

  • Linear regression (LINEAR_REG) for predicting continuous variables.
  • Logistic regression (LOGISTIC_REG) for classification tasks.
  • K-means (KMEANS) for category identification.
  • XGBoost (BOOSTED_TREE_CLASSIFIER) for implementing machine learning algorithms based on the Gradient Boosting framework.

To enhance our previous outcomes, we will try using XGBoost. I recommend the following options for BOOSTED_TREE_CLASSIFIER:

  • L2_reg = 0.1
  • num_parallel_tree = 8
  • max_tree_depth = 10

CREATE OR REPLACE MODEL bqml_tutorial.sample_model_2

OPTIONS(

model_type='BOOSTED_TREE_CLASSIFIER',

l2_reg = 0.1,

num_parallel_tree = 8,

max_tree_depth = 10) AS

SELECT

IF(totals.transactions IS NULL, 0, 1) AS label,

IFNULL(device.operatingSystem, "") AS os,

device.isMobile AS is_mobile,

IFNULL(geoNetwork.country, "") AS country,

IFNULL(totals.pageviews, 0) AS pageviews

FROM

bigquery-public-data.google_analytics_sample.ga_sessions_*

WHERE

_TABLE_SUFFIX BETWEEN '20160801' AND '20170630'

Creating an XGBoost model generally takes longer than a simple linear regression model, around nine minutes. Afterward, we will evaluate our model to gauge our success.

SELECT

FROM

ML.EVALUATE(MODEL bqml_tutorial.sample_model_2, (

SELECT

IF(totals.transactions IS NULL, 0, 1) AS label,

IFNULL(device.operatingSystem, "") AS os,

device.isMobile AS is_mobile,

IFNULL(geoNetwork.country, "") AS country,

IFNULL(totals.pageviews, 0) AS pageviews

FROM

bigquery-public-data.google_analytics_sample.ga_sessions_*

WHERE

_TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))

The results indicate slight improvements, as evidenced by the updated metrics for precision, accuracy, and ROC AUC, while log loss has also decreased.

Conclusion

BigQuery ML serves as a remarkable tool for expediting your machine learning processes, especially with large datasets. By following this tutorial, you will acquire practical experience in creating, assessing, and deploying ML models using BigQuery ML.

Try it out and discover how it can optimize your machine learning projects and foster collaboration within your team.

If you found this format helpful, please consider clapping and following my profile to encourage more content creation! Also, subscribe to my newsletter for timely updates on new materials.

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

Using Google Finance for Market Insights

Discover how to effectively utilize Google Finance for real-time market insights, empowering your investment decisions.

Understanding Your Emotions for Better Decision-Making

Explore how recognizing your emotions enhances decision-making for business success.

Building a Universal ChatGPT Access App: My Progress So Far

Discover my journey in creating an app for seamless ChatGPT access across platforms and share your suggestions for features!