Skip to main content

DataStax

Overview

Qarbine uses the vector enabled API rather than the legacy ones which have been deprecated per https://docs.datastax.com/en/astra-db-serverless/api-reference/legacy-overview.html

DataStax’s native query interface uses a programmatic interface to retrieve data. The general use case is to identify a collection with data of interest. Next a filter and/or a vector of values is used to indicate a point in n-space from which to locate similar data. It is recommended that the query set at most how many items to retrieve using the “limit:” field.

Qarbine provides 2 query options for DataStax:

  1. JSON query structure and
  1. SQL-like request.

The latter is an interface that resides above the JSON query structure API. It can be more convenient to specify criteria using SQL than the more complex JSON structure approach. Most of the DataStax features are still accessible when using the SQL syntax so there is no loss of DataStax features or complex middleware in play.

JSON Based Query Structure

One way to specify a DataStax query in Qarbine is to use a JSON-like structure which looks like the following.

{
collection: string,
$vector: [...],
$vectorize: string,
nearText: string,
useAssistant: string,
nearVector: [...],
filter: { },
returnProperties: [ ],
projection: { },
includeSimilarity: boolean,
includeSortVector: boolean,
skip: number,
limit: number,
sort: { },
maxTimeMS: number,
namespace: string
}

For vector searches either the nearText, nearVector, $vector, or $vectorize field must be defined. If a nearText value is set, then the named Qarbine AI Assistant is used to obtain the embedding. If the $vectorize value is set, then the DataStax integration defined for the collection is used to obtain the embedding.

Remember the number of values in the vector must correspond to the same number within the index’s underlying data. Preferably the exact same model used to persist embedding sis used to obtain an embedding for querying.

Primary Options

The primary specification options are described below.

Field Description
collectionThe collection to perform the query upon.
$vectorThe list of embeddings from which to determine similarity.
nearVectorThe equivalent of $vector.
$vectorizeThe phrase to obtain an embedding for to use in the query. Preferably this is actually set in the sort object.
nearTextThe value is a string with the similarity phrase. For example, “dracula movies”. An embedding value for the nearText argument will be obtained by Qarbine using a configured Qarbine AI Assistant. When using this option, the model used to insert the data must correspond to the one used by the Qarbine AI Assistant.
useAssistantFor nearText, this indicates a specific AI Assistant to use to obtain the embedding.
returnPropertiesThe list of field names to return for each matching element. An example is [“color”]. By default any vector field is not returned in order to reduce the answer set size. If you want the vector values, then explicitly list it here or use the includeSortVector flag. To get all the fields and the vector field use something like the following,
returnProperties: [ '*', 'vector']
projectionThe object indicating what fields to include or exclude See the discussion below for details.
includeSimilarityA boolean indicating whether to include the similarity score in the results
includeSortVectorA boolean indicating whether to include the vector list in the results.
filterThe filtering object. See the section below for more information.
skipIt indicates how many objects of the answer set to skip over as part of the returned answer set.
limitThe maximum number of matches to return.
sortThe object indicating how to order the results.
maxTimeMsThe maximum time to run the query before timing out.

Filtering Data

The filter argument is a JSON object representing the criteria. In brief, the filtering offers a subset of operators similar to MongoDB.

Operator Description
$eqEqual to (number, string, boolean)
$neNot equal to (number, string, boolean)
$gtGreater than (number)
$gteGreater than or equal to (number)
$ltLess than (number)
$lteLess than or equal to (number)
$inIn array (string or number)
$ninNot in array (string or number)

These operators can be combined with $and and $or. The details on using operators are at
https://docs.datastax.com/en/astra-db-serverless/api-reference/documents.html#operators

There are several items to note on the following liinks
https://docs.datastax.com/en/astra-db-serverless/api-reference/document-methods/find.html

https://docs.datastax.com/en/astra-db-serverless/api-reference/document-methods/find.html#find-documents-using-filtering-options

Some things to keep in mind are:

  • Sort and filter operations can use only indexed fields.

Finding Distinct Values

Note that distinct is a client-side operation, which effectively browses all required documents using the logic of the find command, and then collects the unique values found for key. Below is a framework for the JSON specification.

{
collection: "movies",
distinct: field,
filter: { },
maxTimeMS: number,
limit: number,
sortValues: 'asc', ← asc, ascIns, desc, or descIns (“Ins” is for case insensitive)
}

A sample JSON specification is shown below.

{
collection:"movies",
distinct: 'criticname',
filter: { criticname: { $ne: 'Avi Offer' } },
sortValues: 'asc',
// This limit applies after all the above has been done.
limit: 10
}

An equivalent SQL syntax version is shown below

SELECT DISTINCT criticname
FROM movies
WHERE criticname <> 'Avi Offer'
order by criticname asc
limit 10

Managing What Data is Returned

The native DataStax field used to control this is “projection”. This key\value pair

projection: { _id: 1, foo: 1, bar: 1 }

is the same as the “returnProperties” field shorthand option of

returnProperties: [ ‘_id’, ‘foo’, ‘bar’]

You can mix the two with the returnProperties taking precedence of projection ones.

For complete details on DataStax projection option see https://docs.datastax.com/en/astra-db-serverless/api-reference/documents.html#projection-operations

For regular fields, a projection can’t mix include and exclude projections. It can contain only true or only false values for regular fields. The special fields _id, $vector, and $vectorize have individual default inclusion and exclusion rules. You can set the projection values for these independently of regular fields.

The _id field is included by default. The $vector and $vectorize fields are excluded by default. The $similarity key isn’t a document field, and you can’t use this key in a projection. The $similarity value is the result of a vector ANN search operation with $vector or $vectorize. Use the includeSimilarity parameter to control the presence of $similarity in the response.

Qarbine SQL Interface

Overview

DataStax supports semantic (i.e. vector) search and a lexical (i.e. scalar/matching) search. The use of the specification structure described above can be a bit verbose and cumbersome though. To improve readability and productivity when authoring DataStax retrievals, Qarbine provides a SQL oriented option. The Qarbine SQL interface below translates the SQL clauses into the equivalent native DataStax filter structure. Use the explain feature described below to confirm this. If things are not what you expect, then use the JSON structure interface.

SQL Oriented Filtering

Here is an example of a specification based search retrieval.

{
collection:"movies",
filter: { criticname: { $ne: 'Avi Offer' } },
limit: 10
}

The equivalent Qarbine SQL is simply

SELECT *
FROM movies
WHERE criticname <> 'Avi Offer'
limit 10

For many retrievals the SQL approach is a much more approachable way to specify the criteria. The mapping of the standard SQL clauses to their DataStax equivalents is described below.

Clause Description
SELECTThe names of the fields to return. Specifying “*” indicates all object fields . This does not set the returnProperties field in which case DataStax returns all of the properties. Here are some examples.
SELECT * …
SELECT _id, title, rating …
SELECT _id, title, rating, similarity

Note there is no leading $ in the “similarity” column reference to set includeSimilarity to true. Likewise, specifying “vector” rather than “$vector” sets the includeSortVector flag to true.
FROMThe name of the collection. This value sets the “collection” field in the query specification.
WHEREThe effect is to set the “filter” field of the query specification.
ORDER BYThe sorting rules in “column asc|desc” format.
OFFSETIndicates where in the the return objects start return objects. This sets the “skip” field of the query specification.
LIMITIndicates at most how many elements to return. This sets the “limit” field of the query specification.

Bear in mind that some combinations of query fields may not make sense in the DataStax world.
The WHERE clause criteria can be in a variety of traditional SQL forms and may include Qarbine specific functions described below. For example,

select * from Movies 
where nearText("dracula")

results in a query specification with these fields,

collection: "Movies",
nearText: "dracula"

Some additional Qarbine defined SQL functions are listed below.

nearVector(number1, number n …)
nearText(aPhrase)
vectorize(aPhrase)
vector = (number 1, number n ...)A different way of expressing nearVector()

Note that a SQL list is enclosed in parentheses while one in the JSON specification is enclosed in brackets. That is a subtle nuance across the SQL and JSON syntax standards.

The table below describes their use and a few others.

Function Description
nearVectorThis clause is removed from the WHERE criteria and its list of numbers argument set into the “nearVector” field of the query specification.
nearTextThis clause is removed from the WHERE criteria and its argument set into the “nearText” field of the query specification.
vectorizePass in a string argument that ends up in the query specification as “$vectorize: theString”.
vector = (...)Provide the embedding enclosed in parenthesis. This ends up in the query specification as “$vector: [...]”.
withOptionPass in the specification field name and the value to set. This clause is removed from the WHERE clause. This is a way to set a query specification field that is not readily translated from the SQL world.
withOptionsSet several specification fields at once. The format is withOptions(key1, value1, keyN, valueN).The key argument may use dot notation when setting the inner value of a component object.

There are several SQL functions which are used to easily access the DataStax filter features. They are listed below. For reference see the DataStax page at https://docs.datastax.com/en/astra-db-serverless/api-reference/documents.html#operators

Function Description
fieldExists(field)This maps to the DataStax $exist clause. It matches documents that have the specified property.
{ "title" : {"$exists" : true} }
An example use is shown below.
where fieldExists('title')
This can be combined with the NOT operator as shown below.
where not fieldExists('title')
This results in the DataStax clause
{ "title" : {"$exists" : false} }
size(field, howMany)This maps to the DataStax $size clause. It selects documents where the array has the specified number of elements.
{ "someList" : {"$size" : 2 } }
An example use is shown below.
where size('someList', 2)
listIs(field, values)This maps to the DataStax $all clause. It matches arrays that contain all elements in the specified array.
{ "food" : {"$all" : ['apple', 'orange'] } }
An example use is shown below.
where listIs(‘food’, ('apple', 'orange') )
An alternative is to use the following syntax.
where food = ('apple', 'orange')
DataStax may perform optimizations when $all is used so it is the recommended approach.

Blending JSON and SQL

There are techniques to blend the ease of using SQL along with the powerful features of DataStax within a Qarbine JSON specification object. The table below lists the fields that drive this definition.

JSON Field Description
sqlThe SQL statement can affect most all of the primary options listed above.
sqlWhereThe string can affect mainly the filter option above.

Here is a simple example of combining the SQL and query specification approaches. The effective result is the same as the example query specification above.

{
collection:"movies",
sqlWhere: "criticname = 'Avi Offer'",
limit: 10
}

Vector Queries

Prerequisites

Prior to using Qarbine’s embeddings(...) macro function or the SQL-like query function nearText(...), the Qarbine Administrator must first configure “AI Assistant(s)”. The AI Assistants provide access to various popular Generative AI services and are referenced using an alias. Check with your Qarbine administrator for which ones are available and their proper use. For example, when using dynamic query vector embeddings, the model used by the AI Assistant must be compatible with the one used to generate the original embedding values in the database.

Searches

For details on DataStax vector searches see the information at
https://docs.datastax.com/en/astra-db-serverless/databases/vector-search.html

There are several ways to use vector values in a query:

  1. Explicitly embedding values.
  1. Use DataStax $vectorize for the associated DataStax AI integration to obtain the embeddings.
  1. Use Qarbine’s preprocessing to obtain the embeddings from a Qarbine AI Assistant configured by the Qarbine administrator.

The table below maps between the JSON structure and the SQL equivalent.

JSON Structure SQL Clause
sort: { $vector: [0.15, 0.1, 0.1, 0.35, 0.55] }nearVector(0.15, 0.1, 0.1, 0.35, 0.55)
sort: { $vectorize: "foo" }vectorize(“foo”)
sort: { $vector: [! embeddings("foo", "myOpenAi") !] }nextText(“foo”)

Reviewing the Generated Specification

You can enter criteria of the form “EXPLAIN SELECT ….” to have the SQL statement processed and have the returned answer set be the underlying query specification. For example enter and run

explain
select * from movies
where title <= 'Paa'
order by title limit 20

Select the single result element and its details are shown to the right.

  

If necessary, click the “+” to expand all of the JSON object fields.

A convenient way of specifying this is to have “explain” on the first line and the rest of your SQL on the next lines.

explain
select * from movies
where title <= 'Paa'
order by title limit 20

Then simply “comment out” the first line when not in use

// explain
select * from movies
where title <= 'Paa'
order by title limit 20

You can also use “explain: true” in the JSON query specification for similar information.

Another way to get the specification is to press ALT and click   . Below is a sample result.

  

Any “explain SELECT” or “explain: true” takes precedence over the ALT-click interaction.

Qarbine Virtual Queries

There are a few convenience queries which are mainly DBA oriented. These queries are recognized by the Qarbine driver and provide common database information.

Query Description
list databasesReturn a list of databases.
describe databaseReturn the details of the default database.
describe databasesReturn the details of all the visible databases.
list collectionsReturn a list of collections.
describe collectionsProvide details on all of the collections. This may take a while depending on your database structure.
describe collection COLLECTIONProvide details on the given collection.
list keyspacesReturn a list of keyspace names.
tally collectionsReturn the estimated document count for all collections.
tally collection COLLECTIONReturn estimated document count for the given collection.

See the “DBA Productivity” section of the online documentation for more details.

Troubleshooting

Node.js

Below is a sample node.js program framework to evaluate your queries.

var { DataAPIClient, VectorDoc, UUID } = require('@datastax/astra-db-ts');
const client = new DataAPIClient(YOUR_APPLICATION_TOKEN);
const db = client.db(YOUR_ENDPOINT);
(async function () {
const collection = await db.collection('vector_test');
const cursor = await collection.find(
{ },
{
sort: { $vector: [0.15, 0.1, 0.1, 0.35, 0.55] },
limit: 10,
includeSimilarity: true,
});
console.log('* Search results:');
for await (const doc of cursor) {
console.log(JSON.stringify(doc, null, 2) );
console.log();
}
})();

CQL Console

Some queries can be reviewed using the CQL Console. Sign on to Datastax and navigate to your database.

Open the CQL console by clicking

  

Type in, paste or otherwise edit the query.

Run the query.

Review the output and adjust as necessary.