Databricks destination is not storing JSON values as strings correctly
Which destination?: Databricks (both managed & unity catalog)
Additional details:
Currently, if the input is a postgres JSON column, the JSON string is not quoted, causing schema confusion with other JSON types like JSON object.
For example, if a JSON column in postgres table has these following rows, with one containing a JSON string and the other containing a JSON array:
col
---
"[1, 2, 3]"
---
[1, 2, 3]
Both will be stored as [1, 2, 3] as STRING column in databricks. This make it impossible to tell them apart.
The correct behavior should be that the string is quoted. This is the standard behavior in otter ETL tool (like Estuary) or in the programming world, like in Python
>>> json.dumps("123")
'"123"'
>>> print(json.dumps("123"))
"123"
(notice that it is quoted and not just 123)
Why this is impacting us:
- We store text AI chat messages as JSON strings (like "how are you"), and more complicated messages as objects (like { "type": "image", "url": "http://..." }). We can't work on the data if we can't differentiate if the source value is a JSON string or array. Even worse, the user can send a string representing json object, and we'll parse that as JSON object, which is dangerous
Please sign in to leave a comment.
Comments
0 comments