How to Combine Data from Two BigQuery Datasets

0
1363
BigQuery Datasets

BigQuery is a data warehouse, which implies a great degree of centralization. You can apply BigQuery SQL on single or multiple BigQuery Datasets. The benefits that BigQuery offers are much appreciated when you combine BigQuery Datasets from completely different domains.

This article will teach you how to unify data from two of BigQuery’s Publically available datasets using SQL.

What Is Google BigQuery?

Google BigQuery is a Cloud Datawarehouse that is managed by Google, capable of analyzing huge amounts of data within seconds. If you are well equipped with working knowledge of SQL Queries, you are halfway through its working. There are numerous public datasets available that you can use to get hands-on experience.

To access & work on BigQuery Dataset, you can either use the GCP console or the classic web UI, with the help of a command-line tool or by making calls to BigQuery Rest API using Client Libraries such as .Net, Java, or Python.

There are various tools by third-party vendors that can help interact with BigQuery Datasets, for the purpose of visualizing the data or loading the data.

What are the Publicly Available BigQuery Datasets?

A public dataset is a dataset that is stored in BigQuery and is made available to the users or general public via Google’s Cloud Public Dataset Program. The public datasets are the ones that BigQuery hosts and allow users to access and integrate them into their applications. A few examples are as follows:

  • GSOD
  • Github_Nested
  • Github_timeline
  • Natality
  • Wikipedia
  • Trigram
  • Shakespear

What is the ‘JOIN’ BigQuery SQL command?

To combine data in three or more BigQuery Datasets, you can design or set up a join among two tables, then build a join between either of the two tables & a third one, etc. till the time all of them are joined. The syntax of the JOIN clause that you will write depends on the size of the tables you plan on joining.

The JOIN operation simply merges two desired items so that the SELECT clause can query them as one source. Join condition specifies how to combine and discard rows from the two items to form a single source.

For more detailed information about the Join BigQuery SQL command, click here.

Step-By-Step Guide to Combine BigQuery Datasets

Let’s learn to join two different BigQuery Datasets that are publicly available in BigQuery SQL (Structured Query Language).

Prerequisites

  • Working knowledge of BigQuery SQL
  • Familiarity with BigQuery As a platform.
  • Efficient knowledge of the BigQuery Dataset that you will be working with.

Step 1: Have a clear goal of what data you want to fetch from the tables. table.

For Eg: the following are the public BigQuery Datasets that we are considering.

  • UN SDG = Growth rate of GDP per capita (%)/Annum
  • World Bank WDI = Overall Population

Let’s write a BigQuery SQL command to select the following data from the BigQuery Dataset-1.

SELECT geo_area_name, time_period, values

FROM `bigquery-public-data.un_sdg.indicators` as UN-SDG

WHERE series_description = ‘Growth rate of real GDP per capita (%)/Annum’ 

AND time period = ‘2016’

Let’s write BigQuery SQL command to select desired data from BigQuery Dataset-2

SELECT year, value, country_name FROM `bigquery-public-data.world_bank_wdi.indicators_data`as WB-WDI

WHERE indicator_name = ‘Population, total’

AND year = 2016

Step 2: Now, write a SQL Query to combine the desired data together.

SELECT UN-SDG.geoareaname, UN-SDG.timeperiod, UN-SDG.value as GDP_per_Capita_growth, WB-WDI.country_name, WB-WDI.year, WB-WDI.value as WB_Population

FROM `bigquery-public-data.un_sdg.indicators` as UN-SDG JOIN `bigquery-public-data.world_bank_wdi.indicators_data` as WB-WDI on WB_WDI.country_name = UN-SDG.geoareaname

WHERE UN-SDG.seriesdescription = ‘Growth rate of real GDP per capita (%)/Annum’ 

AND UN-SDG.timeperiod = ‘2016’

AND WB-WDI.indicator_name = ‘Population, total’

AND WB-WDI.year = 2016

Benefits of Combining BigQuery Datasets

  • Google BigQuery Architecture houses support for interactive dataset querying and provides you with a consolidated view of the datasets across projects that you can access.
  • It combines the scope that is difficult to analyze in the interface.
  • It provides an overall view of the Datasets with multiple dimensions as required while querying.
  • It allows the user to mix any dimension and scope while working with BigQuery Datasets.

Limitations of Combining BigQuery Datasets

  • There might be a few outliers present in the BigQuery Dataset.
  • A few entries might be conspicuously absent from the Data Results.

Conclusion

BigQuery is a sophisticated & mature service that is feature-rich, economical, and fast. BigQuery also offers integration with Google Drive and a free Data Studio visualization toolset which is very helpful for comprehension and analysis. It can process a huge amount of BigQuery Data within a few seconds. In this article, you have learned about Public BigQuery Datasets & how to combine them in 2 easy steps using BigQuery SQL command- “JOIN”.