DocumentDB
Overview
AWS DocumentDB is a fully managed document database service that offers some MongoDB API compatibility. It supports MongoDB versions 3.6, 4.0, and 5.0 APIs. While Amazon DocumentDB aims to be compatible with MongoDB, it does not support every MongoDB 5.0 feature. It provides a semi-MongoDB-compatible environment that can be used with existing MongoDB applications and tools, though it may not fully support all MongoDB features.
See this page for more information on compatibility
https://docs.aws.amazon.com/documentdb/latest/developerguide/compatibility.html
For detailed compatibility information and functional differences, it’s recommended to consult the official AWS documentation at https://aws.amazon.com/documentdb/.
Refer to the MongoDB API tutorial which uses core MongoDB API querying concepts for sample DocumentDB interactions.
AWS Sample Movies Data
AWS has a sample set of data containing movie information. It can be found at this page.
https://github.com/awsdocs/aws-doc-sdk-examples/tree/main/resources/sample_files
Information about loading this data set is in the Qarbine Administrator’s guide for configuring AWS DocumentDB within the “Tutorial Data” section. This tutorial assumes a data service named “docDB” has been defined and its default database is “samples”.
Defining a Data Source
Overview
A Data Source is a Qarbine component responsible for retrieving data from somewhere. At a high level it has a name, a description and some arbitrary query string which when sent to the associated Qarbine Data Service endpoint returns some data. The overall execution flow for an analysis, including the optional prompt component, is shown below.
A single data source can be referenced by name from multiple Qarbine template components. This enables a single point of change when perhaps, an index is added, or some other query tweak is necessary. The alternative is to attempt to find all templates impacted by a schema or index change for example. This component reusability is especially beneficial when team members have varying roles and skills.
Query Language
DocumentDB (with MongoDB compatibility) uses the MongoDB query Language (MQL) as its query language. Likewise the answer sets contain a list of documents which are arbitrary JSON objects. There is no default rule that these JSON objects must be similar in structure This dynamic answer set behavior can be quite cumbersome for legacy tools which require 2 dimensional homogeneous answer sets.
Open a Data Source Designer and then select the data service per your Qarbine Administrator. He/she may have given the data service a different name or not install the sample data as well. Below are the top 2 dropdown with example selections.
Just below it and to the left is shown a fully expanding sample schema for the movies collection.
The “info” field is an embedded document. It in turn has 3 embedded arrays: actors, directors, and genres. Those arrays contain simple strings.
A sample document with its embedded content fully expanded is shown below.
The embedded document and its arrays would be a real burden for legacy SQL tools and general analytics tools as well. Qarbine is fine with them and has options for manipulating the answer set shape as well.
For a query specification enter the following
db.movies.find( { "info.rating" : { $gt: 8.5} } , { } ).sort( { "info.rating" : 1} )
Below are some of the sample results.
Below are details of a selected document.
Notice that much of the information is within the embedded info field document. Qarbine provides options to manipulate the answer set using “pragmas”. We can use the “pullFieldsUp” pragma to pull all of the subfields up a level to simplify the result shape. In the layout instead of accessing the directors array via @current.info.directors or #info.directors we can simply use @current.directors or #directors.
Here is the updated query.
#pragma pullFieldsUp info
db.movies.find( { "info.rating" : { $gt: 8.5} } , { } ).sort( { "info.rating" : 1} )
Running this the answer set shape is now the following.
A sample element is shown below with the fields with the “info” field pulled up a level.
Managing Answer Set Size
The default maximum number of rows starts off at 25 for a new data source. This is useful to evolve a query from a concept to one that you have verified returns the desired answer set. As noted, any native way of limiting an answer set size is the preferred approach. This setting is in the component dialog as shown below and also accessible by clicking the ‘Gear’ icon.
Once you are done drafting you can adjust this parameter. A “0” indicates there is no maximum. A number greater than 0 indicates to limit the final answer set size to that number of rows. This answer set truncation comes after any native query limit. So, if the answer set from the data endpoint is quite large, that content has to be returned to the Qarbine host. It then may truncate the number of rows. It is best to truncate at the query level (i.e., use a limit) to reduce the content sent from the data endpoint to the Qarbine host in the first place.
Adjusting the Maximum Rows
Recall the default maximum rows at the component level is 25. When you are satisfied with your query you can change that setting by clicking.
Adjust the setting to “0” indicating no Qarbine answer set truncation.
Click
Saving Your Component
This can be saved in the catalog as a data source named “Movies with ranking higher than 8.5”.
Click
Navigate to the target catalog folder within the dialog.
Fill in the name, description and other properties as desired.
Click
Defining an Analysis Template
Overview
A template defines how to process the data being retrieved from Data Source queries and other data expressions. It also defines formulas, formatting options, and other analysis and presentation options. The overall execution flow for an analysis, including the optional prompt component, is shown below
In this example we will discuss how the output below was obtained through a Qarbine template.
Generating an Initial Template
In the Data Source Designer click on the toolbar icon highlighted below.
This option generates an initial layout based on the structure of the answer set. In the presented dialog accept the template name as provided.
In the bottom half blank out the Data Source name so that the data source we just defined is referenced by the soon to be generated layout.
Then click
A Catalog Dialog appears to obtain the target folder. Navigate to the intended folder. You may also create new folders in this dialog if the appropriate permissions are in place.
Click to proceed.
A template will be generated that references the data source. A prompt appears for which tools to open after the generation.
Leave the Template Designer checked and then click
A Template Designer is opened with the generated template. Below is an example template.
Reviewing the Data Retrievals
On the right hand side of the Template Designer choose the drop down noted below to review the data retrieval strategy of the template.
There is the main retrieval from the data source. Each of the embedded arrays has a group level formula.
Running the Template
Run this initial template by clicking
Below are sample results as a starting point.
The generated template is a good starting point. It saves a lot of typing as we adjust the cells a bit, and add some formatting.
Adjusting the Initial Template
Here are a few tweaks to the initial template cells:
- For the first body line remove the “_id” label and its value and place the bolded year and title values on that line.
- Remove the info.released_date value as it is redundant with year.
- Move the embedded array output to the right.
- The running_time_seconds label was changed to “Duration” and the value now shows hours via
=concat(format(@main.running_time_secs/3600, 'number', '#.0') , ' hours')
- Shrink some of the numeric value cells such as the year value which are 4 digits in size.
- The image URLs in the AWS sample data are invalid. Remove the line with that information. See the Template Designer tutorials section for how to add images to templates.
The general template layout is shown below.
Running this we now get the following for one of the documents.
Saving Your Changes
To save your updates click
Since the component was already in the catalog no prompt was presented for the catalog path.
Next Steps
Querying Your Database
For database specific interaction guides navigate to
http://doc.qarbine.com/docs/category/data-source-designer