How to evaluate json string column in databricks sql


When working with JSON data stored as strings in a Databricks table, you might need to extract specific fields and perform computations on them. In this guide, we’ll explore how to extract the total_cost field from a JSON string column named metadata and calculate its average using SQL in Databricks.

The Challenge

Suppose you have a table my_table with a column metadata containing JSON strings like:

{"total_cost": "150.75", "other_data": "..."}

Your goal is to:

  • Access the total_cost key within the JSON string.
  • Convert the extracted value to a numeric type.
  • Calculate the average of all total_cost values.

Solution Overview

We’ll use the following functions:

  • get_json_object: Extracts a JSON value as a string.
  • TRY_CAST: Safely converts a string to a numeric type, returning NULL if the conversion fails.
  • AVG: Calculates the average of numeric values, ignoring NULLs.

SQL Query

Here is the SQL query to achieve this:

SELECT 
AVG(TRY_CAST(
get_json_object(metadata, '$.total_cost') AS DOUBLE
)) AS average_total_cost
FROM
my_table;

Explanation

  • get_json_object(metadata, '$.total_cost'):

    • Extracts the value associated with the total_cost key from the metadata JSON string.
    • The '$.total_cost' argument specifies the path to the total_cost field in the JSON object.
  • TRY_CAST(... AS DOUBLE):

    • Attempts to convert the extracted total_cost string to a DOUBLE (a floating-point number).
    • If the conversion fails (e.g., if the value is not a valid number), it returns NULL instead of throwing an error.
  • AVG(...):

    • Computes the average of all the DOUBLE values.
    • Automatically ignores NULL values, so any failed conversions won’t affect the result.

Steps Breakdown

  1. Extract the total_cost Value:

    • Use get_json_object to parse the JSON string and retrieve the total_cost value.
    • Example:
      get_json_object('{"total_cost": "150.75"}', '$.total_cost')  -- Returns "150.75"
  2. Convert to a Numeric Type:

    • Use TRY_CAST to safely convert the extracted string to a DOUBLE.
    • Example:
      TRY_CAST("150.75" AS DOUBLE)  -- Returns 150.75 (as a DOUBLE)
    • If the string is not a valid number, TRY_CAST returns NULL.
  3. Calculate the Average:

    • Use AVG to compute the average of all DOUBLE values.
    • AVG ignores NULL values, so any non-numeric or missing total_cost entries won’t affect the calculation.

Example

Assume my_table contains the following metadata values:

metadata
{"total_cost": "100.50", "other_data": "..."}
{"total_cost": "200.25", "other_data": "..."}
{"total_cost": "invalid_number", "other_data": "..."}
{"other_data": "..."}

Applying the query:

  1. Extract and Convert:

    Extracted total_cost Converted to DOUBLE
    "100.50" 100.50
    "200.25" 200.25
    "invalid_number" NULL
    NULL (missing key) NULL
  2. Compute Average:

    • Average of 100.50 and 200.25 (ignoring NULLs):
      (100.50 + 200.25) / 2 = 150.375

Conclusion

By combining get_json_object, TRY_CAST, and AVG, you can efficiently extract numeric values from JSON strings and perform aggregate calculations in Databricks SQL. This method handles invalid or missing data gracefully, ensuring accurate and reliable results.


Author: robot learner
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source robot learner !
  TOC