curl 获取json数据 导入hive外部双分区表 解决科学计数法和特殊字符问题 ——首发
shell 脚本
hadoop dfs -mkdir /jzytest/hivePublicOpinionList/
#创建hive外部双分区表
hive -e 'create external table if not exists jzyPublicOpinionListjson
(
channel string,
publishTime string,
source string,
title string,
url string
)
partitioned by (topic_id string,dt string)
row format serde "org.apache.hive.hcatalog.data.JsonSerDe"
LOCATION "hdfs:///jzytest/hivePublicOpinionList/";'
channel_array=("21" "104" "303")
dt=$(date "+%Y-%m-%d")
for topic_id in $(cat jzyhive.out)
do
hadoop dfs -mkdir -p /jzytest/hivePublicOpinionList/topic_id=${topic_id}/dt=${dt}
for channel in ${channel_array[*]}
do
curl -X POST -d '{
"version":"1.0",
"caller":"pxxx",
"componentName":"pxxx",
"password":"pxxx",
"callee":"tencent_poa",
"eventId":293194931,
"seqId":"1481860426.8432320581878515392",
"spanId":"pxxx",
"timestamp":1481860426,
"interface":{
"interfaceName":"qcloud.supervision.getPublicOpinionList",
"para":{
"topic_id": "'$topic_id'",
"channel": "'$channel'",
"page": 1,
"page_size": 30
}
}
}' http://2xx.1xx.xx2.xx4:9090/interface | awk -v head=":[" -v tail="]}" '{print substr($0, index($0,head)+length(head),index($0,tail)-index($0,head)-length(head))}' > jzytopic_id${topic_id}channel${channel}.out
sed -i "s/},/}\n/g" jzytopic_id${topic_id}channel${channel}.out
sed -i "s/《/<</g" jzytopic_id${topic_id}channel${channel}.out
sed -i "s/》/>>/g" jzytopic_id${topic_id}channel${channel}.out
hadoop dfs -put jzytopic_id${topic_id}channel${channel}.out /jzytest/hivePublicOpinionList/topic_id=${topic_id}/dt=${dt}/
rm -rf jzytopic_id${topic_id}channel${channel}.out
done
hive -e "alter table jzyPublicOpinionListjson add partition (topic_id='$topic_id',dt='$dt');"
done
第一次标红,是数据格式是{*“data”[{*}]}获取{*} *的内容
第一次sed(未红)是数据的分隔符是“,”号放在hive中会出问题,而hive中默认可以按行读取所以应该换成“\n”,建表事就可以不用考虑分隔符问题
第二次标红,是对书名号进行特殊处理否则会出现以下错误:
Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: java.io.IOException: Start token not found where expected
第三次标红,是防止加载外部分区表变成科学计数法。
总结:
1.获取的json数据要用awk过滤成需要的数据
2.数据中的特殊字符需要出,比如书名号
3.数据中的字段,需要与hive表字段对应,对特殊字段需要处理,如local,如对应不上数据会显示NULL
4.分区字符串如果是纯数字,需要在alter加载分区的时候加“''”单引号,不然会出现科学计数法,例如topic_id=14538744803109468122
最后附上成功后图片,有图有真相
更多推荐
所有评论(0)