When working with JSON data stored as strings in a Databricks table, you might need to extract specific fields and perform computations on them. In this guide, we’ll explore how to extract the total_cost
field from a JSON string column named metadata
and calculate its average using SQL in Databricks.
The Challenge
Suppose you have a table my_table
with a column metadata
containing JSON strings like:
{"total_cost": "150.75", "other_data": "..."} |
Your goal is to:
- Access the
total_cost
key within the JSON string. - Convert the extracted value to a numeric type.
- Calculate the average of all
total_cost
values.
Solution Overview
We’ll use the following functions:
get_json_object
: Extracts a JSON value as a string.TRY_CAST
: Safely converts a string to a numeric type, returningNULL
if the conversion fails.AVG
: Calculates the average of numeric values, ignoringNULL
s.
SQL Query
Here is the SQL query to achieve this:
SELECT |
Explanation
get_json_object(metadata, '$.total_cost')
:- Extracts the value associated with the
total_cost
key from themetadata
JSON string. - The
'$.total_cost'
argument specifies the path to thetotal_cost
field in the JSON object.
- Extracts the value associated with the
TRY_CAST(... AS DOUBLE)
:- Attempts to convert the extracted
total_cost
string to aDOUBLE
(a floating-point number). - If the conversion fails (e.g., if the value is not a valid number), it returns
NULL
instead of throwing an error.
- Attempts to convert the extracted
AVG(...)
:- Computes the average of all the
DOUBLE
values. - Automatically ignores
NULL
values, so any failed conversions won’t affect the result.
- Computes the average of all the
Steps Breakdown
Extract the
total_cost
Value:- Use
get_json_object
to parse the JSON string and retrieve thetotal_cost
value. - Example:
get_json_object('{"total_cost": "150.75"}', '$.total_cost') -- Returns "150.75"
- Use
Convert to a Numeric Type:
- Use
TRY_CAST
to safely convert the extracted string to aDOUBLE
. - Example:
TRY_CAST("150.75" AS DOUBLE) -- Returns 150.75 (as a DOUBLE)
- If the string is not a valid number,
TRY_CAST
returnsNULL
.
- Use
Calculate the Average:
- Use
AVG
to compute the average of allDOUBLE
values. AVG
ignoresNULL
values, so any non-numeric or missingtotal_cost
entries won’t affect the calculation.
- Use
Example
Assume my_table
contains the following metadata
values:
metadata |
---|
{"total_cost": "100.50", "other_data": "..."} |
{"total_cost": "200.25", "other_data": "..."} |
{"total_cost": "invalid_number", "other_data": "..."} |
{"other_data": "..."} |
Applying the query:
Extract and Convert:
Extracted total_cost
Converted to DOUBLE
"100.50"
100.50
"200.25"
200.25
"invalid_number"
NULL
NULL
(missing key)NULL
Compute Average:
- Average of
100.50
and200.25
(ignoringNULL
s):(100.50 + 200.25) / 2 = 150.375
- Average of
Conclusion
By combining get_json_object
, TRY_CAST
, and AVG
, you can efficiently extract numeric values from JSON strings and perform aggregate calculations in Databricks SQL. This method handles invalid or missing data gracefully, ensuring accurate and reliable results.