hive与presto解析json数组并拆分为多行
json
适用于现代 C++ 的 JSON。
项目地址:https://gitcode.com/gh_mirrors/js/json

·
案例需求
现有如下json字符串数组,需要把其拆分多行
[{
"tabid": "新人专享",
"id": 100015129300,
"channelid": "APP",
"ass_rule": {
"lv1": "ass",
"lv2": "new1"
}
}, {
"tabid": "新人专享",
"id": 100006377942,
"channelid": "APP",
"uas_rule": {
"lv1": "ass",
"lv2": "new2"
}
}, {
"tabid": "新人专享",
"id": 100014352353,
"channelid": "APP",
"ass_rule": {
"lv1": "ass",
"lv2": "new3"
}
}]
一、hive用法
1、操作分析
分析:没有别的好办法,肯定提取“[]”之间的json字符串之后、后进行分割,逗号分割不行,逗号太多,而我们只想两个大括号之间的逗号做分割“ },{”,所以要把逗号替换成json数组中不存在的其它字符,之后再进行分割,如下,五步走:
1、regexp_extract提取[]内的json字符串
2、regexp_replace替换为逗号为自定义分隔符
3、split根据自定义分隔符分割
4、explode行转列函数
5、get_json_object解析json取字段值
2、完整SQL
SELECT
get_json_object(str_json, '$.id') AS sid
FROM
(
SELECT
event_param_json,
split(regexp_replace(regexp_extract(event_param_json, '(\\[)(.*?)(\\])', 2), '\\},\\{', '\\}#\\{'), '\\#') AS json_list
FROM
abm.abm_wireless_exposure_log
WHERE
dt = '2020-09-06'
AND event_id = 'ProExpo'
AND event_param_json LIKE '%ass%'
)
a lateral VIEW explode(json_list) list_tab AS str_json
或者如下SQL都可以(两种SQL只是提取中括号[]间json字符串的正则不一样)
SELECT
GET_JSON_OBJECT(json_str, '$.id') AS sid
FROM
(
SELECT
json_str
FROM
(
SELECT
split(regexp_replace(regexp_extract(event_param_json -- 获取data数组,格式[{json},{json}]
, '^\\[(.+)\\]$', 1) -- 删除字符串前后的[],格式{json},{json}
, '\\}\\,\\{', '\\}\\|\\|\\{') -- 将josn字符串中的分隔符代换成||,格式{json}||{json}
, '\\|\\|') AS json_list
FROM
abm.abm_wireless_exposure_log
WHERE
dt = '2020-09-06'
AND page_id = 'ManChannel'
-- and event_id = 'RecProExpo'
AND event_param_json LIKE '%ass%'
)
a lateral VIEW explode(json_list) list_tab AS json_str
)
t
如图,运行中间sql,已经分割为多行单个json:
二、presto用法
1、操作分析
是(replace(replace(replace(event_param_json, '[', ''), ']', ''), '},{', '}#{'), '#')
分析:上面hive的正则在presto提取不到,所以json提起就换种方法。两次替换掉json中括号,后进行分割,逗号分割不行,逗号太多,而我们只想两个大括号之间的逗号做分割“ },{”,所以要把逗号替换成json数组中不存在的其它字符,之后再进行分割,如下,五步走:
1、 两次replace替换掉左右中括号[]
2、replace替换为逗号为自定义分隔符
3、split根据自定义分隔符分割
4、unnest行转列函数
5、json_extract_scalar解析json取字段值
2、完整SQL
SELECT
str_json,
json_extract_scalar(str_json, '$.id') AS sid
FROM
(
SELECT
event_param_json
FROM
abm.abm_wireless_exposure_log
WHERE
dt = '2020-09-06'
AND event_id = 'NecProExpo'
AND event_param_json LIKE '%ass%'
)
CROSS JOIN unnest(SPLIT(REPLACE(REPLACE(REPLACE(event_param_json, '[', ''), ']', ''), '},{', '}#{'), '#')) AS t(str_json)
运行结果如下图,json数组就分割为单个json,提取其中某个值了




适用于现代 C++ 的 JSON。
最近提交(Master分支:1 天前 )
dff2b475
Adds pre-multiplication overflow detection to catch cases where dimension
products would exceed size_t max. The previous check only detected when
overflow resulted in exactly 0 or SIZE_MAX, missing other cases.
Retains the original post-multiplication check for backward compatibility.
Adds tests verifying overflow detection with dimensions (2^32+1)×(2^32),
which previously overflowed silently to 2^32.
This prevents custom SAX handlers from receiving incorrect array sizes
that could lead to buffer overflows.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> 16 小时前
eef76c20
* :white_check_mark: add test for C++20 modules
Signed-off-by: Niels Lohmann <mail@nlohmann.me>
* :rotating_light: fix warning
Signed-off-by: Niels Lohmann <mail@nlohmann.me>
* Add missing header (#4763)
* :bug: add missing header
Signed-off-by: Niels Lohmann <mail@nlohmann.me>
* :rotating_light: fix warning
Signed-off-by: Niels Lohmann <mail@nlohmann.me>
* :rotating_light: fix warning
Signed-off-by: Niels Lohmann <mail@nlohmann.me>
---------
Signed-off-by: Niels Lohmann <mail@nlohmann.me>
---------
Signed-off-by: Niels Lohmann <mail@nlohmann.me> 1 天前
更多推荐
所有评论(0)