In AWS QuickSight, you can access various data by creating data sets. Author users in our organisation have been granted access to these data sets.
Occasionally, you might need to combine data from different data sets into a single report. For instance, if you want to view all assignments for a list of workers, but the assignment and people data are stored in separate data sets, this guide will show you how to join these data sets. By doing so, you'll be able to create new data sets that incorporate fields from multiple sources, enabling comprehensive reporting
Note: Only QuickSight Author users can access data sets
Data Set Examples:
Below are the data set examples I am going to use in this article
People: It includes basic information about individuals, such as People ID (a key to identify a person in the table), first name, surname, date of birth, email address, mobile number, gender, age, address, and more.
Assignment: It includes assignment information of the people, such as Assignment ID (a key to identify an assignment in the table), Assignment Number, Start Date, End Date, People ID (as reference), Client, Annual Leave and more.
Agencies: It contains customer information such as Agency ID (a key to identify an agency in the table), Agency Name, Assignment ID, Contact Email Address, Contact Person Name, Address and more.
Data Set Compatibility:
It is important to note that you can't join any two data sets randomly; they must be compatible on a one-to-many or many-to-many basis. Here are some examples to explain one-to-many compatibility using MDA data sets:
People and Assignments: This is a one-to-many relationship, as one person can have many assignments, but each assignment can only have one person.
Agencies and Assignments: This is also a one-to-many relationship, as one agency can have many assignments, but each assignment can only be associated with one agency.
People and Agencies: This is a many-to-many relationship, as an agency can have many people working for it, and a person can work for multiple agencies.
To join data sets, you need a common primary field present in both data sets. Based on the examples above, here are potential common primary fields:
People and Assignments: The Company People ID will be in both the assignment data set and the main people data set, serving as the key.
Agencies and Assignments: The Agency ID will be in both the assignment data set and the main agency data set, serving as the key.
People and Agencies: Since this is a many-to-many relationship, you would link People to Assignments and then Assignments to Agencies.
By ensuring compatibility and using common primary fields, you can effectively join data sets and utilise fields across multiple data sets in your reports.
Data Set Screen:
After logging into QuickSight, please navigate to the data set menu on the left-hand side. Select this menu to go to the data set screen. On this screen, you will see a list of data sets you have access to, along with an option to create a new data set.
Joining Data Set:
To join data sets, please select one data set (referred to as the first parent data set) from the list. Click on the dropdown next to "Use in Analysis" and then select the "Use In Dataset" option.
In this example, I will demonstrate a one-to-many relationship.
This action will take you to the data set creation screen. To add data, please click on the "Add data" button in the top right corner.
A pop-up window will appear to add data. Please select the "Data set" option, then choose the data set you want to connect (referred to as the second parent data set) and click the "Select" button in the pop-up.
Join Clause:
After selecting the data set, you will see the following screen.
Next, click on the two pink circles to configure the join between the data sets. This will bring up the “Join Configuration” menu:
As previously explained, in a one-to-many relationship, there will be a shared ID. In the given example, the People dataset represents the "one" side, while the Assignments dataset represents the "many" side. Therefore, there will be a shared ID that corresponds to the People dataset, specifically the Company People ID in this case. To ensure a proper join, please locate and select the corresponding shared ID in both dropdowns.
Note: In some cases, identifying the IDs may not be straightforward. This could occur due to compatibility issues between datasets, differing ID names across datasets, or even missing IDs altogether. If you encounter any of these situations, we recommend contacting our help desk for assistance and guidance.
Join Type:
This step determines which records the datasets will retrieve. I will now outline the different types below
Tip: The join types are based on Database query language joins
Inner: It will return records with matching values in both data sets. For the above example, it would only pull through people who have an assignment
Left: It will return all records from the left data set and show matching records from the right data set. For the above example, it will pull through all people, but will only show assignment data if they have any, if a person doesn't have an assignment, it will show a blank
Right: It is the same as the “Left” but swaps around the data sets.
Full: It will return records from both data sets, display values for matched rows and null for unmatched rows
As a general guideline, it is recommended to choose a "Left" join when the primary dataset ('one' dataset) is positioned on the left side, and opt for a "Right" join when the primary dataset is on the right side. Alternatively, employing a "Full" join ensures comprehensive inclusion of data from both datasets.
Use "Query Mode" to be SPICE and add the new data set title at the top corner. Once you have done everything, click on the "Apply" button to save the child data set.
Adding more data sets:
To expand your dataset, follow these steps for each additional dataset:
Add a New Dataset: Begin by adding the new dataset using the same process outlined previously.
Establish Temporary Joins: Initially, connect the new dataset to another dataset using a temporary join. This allows you to define the correct relationships later.
Adjust Connections: Utilise the "2x7 dots" icon to drag and move datasets as needed. This step is crucial for managing linear connections between multiple datasets.
For example, people may have multiple assignments and assignments may contain information about agencies.
While dragging above the parent (assignment), if it turns green, then you can let go and the two data sets will join together.
Then you need to redefine the join again. The result will look like below.
The above example will provide people with their assignment and agency information.
Saving the data set:
To rename the data set, simply click on the name located in the top left corner. By default, the new data set inherits the name of the original parent data set. Therefore, it is advisable to rename the new data set (referred to as the child data set) to ensure clarity and differentiation between the two data sets.
Once you have renamed the data set, click “SAVE & PUBLISH” in the top right corner. "Publish & Visualize" will publish and create an analysis using this data set.
Setting refresh schedules:
To ensure your data remains up-to-date, set up refresh schedules to automatically update your data set at specified intervals.
After saving the child data set, navigate to the data set listing screen and select the newly created data set.
On the new page, click on the refresh tab. Here, you can also see wherever the data set has been used and from which data sets it has been created
You can add a schedule by clicking the “ADD NEW SCHEDULE” button in the top right, otherwise, the data won’t refresh automatically
Timezone: Europe/London
Start time: Set the time to be the next interval at whatever time you want each hour for the data set to refresh (this must be a DateTime in the future).
Frequency: Hourly
Highlight: Your parent datasets refresh according to the schedule you've specified. If you schedule the child dataset to refresh simultaneously with the parent dataset, it might result in refresh failures. It's advisable to schedule the refresh of the child dataset after the parent dataset has completed its refresh cycle. If you are unsure about the refresh schedules of your parent datasets, please contact our helpdesk for assistance.
To modify or remove the current refresh schedule for a dataset, click the three dots located under the “Actions” column in the “Schedules” table.
Congratulations! You have successfully connected two datasets. If you need further help, feel free to jump in with any questions or comments; would love to hear them and have a discussion! 😊