BigQuery is a data warehouse, which implies a great degree of centralization. You can apply BigQuery SQL on single or multiple BigQuery Datasets. The benefits that BigQuery offers are much appreciated when you combine BigQuery Datasets from completely different domains.
This article will teach you how to unify data from two of BigQuery’s Publically available datasets using SQL.
Google BigQuery is a Cloud Datawarehouse that is managed by Google, capable of analyzing huge amounts of data within seconds. If you are well equipped with working knowledge of SQL Queries, you are halfway through its working. There are numerous public datasets available that you can use to get hands-on experience.
To access & work on BigQuery Dataset, you can either use the GCP console or the classic web UI, with the help of a command-line tool or by making calls to BigQuery Rest API using Client Libraries such as .Net, Java, or Python.
There are various tools by third-party vendors that can help interact with BigQuery Datasets, for the purpose of visualizing the data or loading the data.
What are the Publicly Available BigQuery Datasets?
A public dataset is a dataset that is stored in BigQuery and is made available to the users or general public via Google’s Cloud Public Dataset Program. The public datasets are the ones that BigQuery hosts and allow users to access and integrate them into their applications. A few examples are as follows:
What is the ‘JOIN’ BigQuery SQL command?
To combine data in three or more BigQuery Datasets, you can design or set up a join among two tables, then build a join between either of the two tables & a third one, etc. till the time all of them are joined. The syntax of the JOIN clause that you will write depends on the size of the tables you plan on joining.
The JOIN operation simply merges two desired items so that the SELECT clause can query them as one source. Join condition specifies how to combine and discard rows from the two items to form a single source.
For more detailed information about the Join BigQuery SQL command, click here.
Step-By-Step Guide to Combine BigQuery Datasets
Let’s learn to join two different BigQuery Datasets that are publicly available in BigQuery SQL (Structured Query Language).
- Working knowledge of BigQuery SQL
- Familiarity with BigQuery As a platform.
- Efficient knowledge of the BigQuery Dataset that you will be working with.
Step 1: Have a clear goal of what data you want to fetch from the tables. table.
For Eg: the following are the public BigQuery Datasets that we are considering.
- UN SDG = Growth rate of GDP per capita (%)/Annum
- World Bank WDI = Overall Population
Let’s write a BigQuery SQL command to select the following data from the BigQuery Dataset-1.
SELECT geo_area_name, time_period, values
FROM `bigquery-public-data.un_sdg.indicators` as UN-SDG
WHERE series_description = ‘Growth rate of real GDP per capita (%)/Annum’
AND time period = ‘2016’
Let’s write BigQuery SQL command to select desired data from BigQuery Dataset-2
SELECT year, value, country_name FROM `bigquery-public-data.world_bank_wdi.indicators_data`as WB-WDI
WHERE indicator_name = ‘Population, total’
AND year = 2016
Step 2: Now, write a SQL Query to combine the desired data together.
SELECT UN-SDG.geoareaname, UN-SDG.timeperiod, UN-SDG.value as GDP_per_Capita_growth, WB-WDI.country_name, WB-WDI.year, WB-WDI.value as WB_Population
FROM `bigquery-public-data.un_sdg.indicators` as UN-SDG JOIN `bigquery-public-data.world_bank_wdi.indicators_data` as WB-WDI on WB_WDI.country_name = UN-SDG.geoareaname
WHERE UN-SDG.seriesdescription = ‘Growth rate of real GDP per capita (%)/Annum’
AND UN-SDG.timeperiod = ‘2016’
AND WB-WDI.indicator_name = ‘Population, total’
AND WB-WDI.year = 2016
Benefits of Combining BigQuery Datasets
- Google BigQuery Architecture houses support for interactive dataset querying and provides you with a consolidated view of the datasets across projects that you can access.
- It combines the scope that is difficult to analyze in the interface.
- It provides an overall view of the Datasets with multiple dimensions as required while querying.
- It allows the user to mix any dimension and scope while working with BigQuery Datasets.
Limitations of Combining BigQuery Datasets
- There might be a few outliers present in the BigQuery Dataset.
- A few entries might be conspicuously absent from the Data Results.
BigQuery is a sophisticated & mature service that is feature-rich, economical, and fast. BigQuery also offers integration with Google Drive and a free Data Studio visualization toolset which is very helpful for comprehension and analysis. It can process a huge amount of BigQuery Data within a few seconds. In this article, you have learned about Public BigQuery Datasets & how to combine them in 2 easy steps using BigQuery SQL command- “JOIN”.