我和 yaml 数据转化大战三百回合的故事

站长

2023年07月30日 00:30 · 阅读数 71

起因

一个需求：要求我们提供为 yaml 中数据特定的 key: value 添加后续 key: value 的能力，用户可以手动选择添加。

比如当 feature_pack_key = xxx 时，加入 feature_pack_version: 1 和 feature_set_version: 2 的补充数据

# case:

retrievers:
- feature_core:
    cluster: live
    feature_pack_key: xxx
 
 # 正常添加后：
 
retrievers:
- feature_core:
    cluster: live
    feature_pack_key: xxx
    feature_pack_version: 1
    feature_set_version: 2

一眼看上去还是很简单的，但 ...... 😅

难题

JS 导致大数精度丢失

业务数据中存在许多 int 大数（value: 100010000011111111111）

使用 js-yaml 这类的库进行 yaml 转 object，会导致数据丢失精度( 变成 100010000011111100000 )

这种肯定是业务方不能接受的，数据都整没了

之前我们这边的处理方法是直接用字符串处理函数进行正则匹配一把梭！但是对于复杂数据结构，这么做是非常不稳定和危险的，代码理解成本极高。

handle_conf(str, handler, value, isSelect, deleteAll) {
    const reg = new RegExp(`((\x20*)(\-\x20template_handler\:\x20*)(\n\x20+))((handler:\x20\${handler}\x20*)(\n?\x20*))((version:\x20\d+\x20*)(\n?\x20*))?`, "g");
    const rep = !value && value !== 0 ? '' : `version: ${value}`;
    const res = str.replace(reg, (match, p1, p2, p3, p4, p5, p6, p7, p8, p9, p10) => {
        let temp = '';
        if (deleteAll) {
            return `${p2}`;
        }
        if (!p8) {
            temp = rep ? `\n${p2}` : '';
            const _p5 = rep ? p6 + p4 : p5;
            return `${p1}${_p5}${rep}${temp}`
        } else {
            const _p5 = rep ? p6 + p4 : `${p6}\n${p2}`;
            const _p10 = rep ? p10 : '';
            return `${p1}${_p5}${rep}${_p10}`
        }
    })
    return res;
}

通过服务端接口转换发现排序丢失

JS 导致的数据精度丢失，绕过 JS 不就好了～

but，因为转换过程中会用到 map，用服务端接口处理 map key 就乱序了！

这个也是业务方不能接受的，因为会导致数据 diff 看板非常混乱。

调研发现可以用的 yaml 库

功夫不负有心人，我找到了这个 js 库 -> 官网： eemeli.org/yaml/#yaml（…

我和 yaml 数据转化大战三百回合的故事

它可以通过配置参数的形式帮我们自动进行 bigint 的转化，终于不用担心精度丢失了，也不会丢失原有 yaml 中 key 的顺序。

import { parse, stringify } from 'yaml'

parse('number: 999')
// { number: 999 }
parse('number: 999', { intAsBigInt: true })
// { number: 999n }
parse('number: 999', { schema: 'failsafe' })
// { number: '999' }

使用转换函数发现 yaml 注释丢失

然而好景不长，转换完发现用户的注释全没了！

# output_adapters is need xxx
output_adapters:
# risk_predict is need xxx
- risk_predict:
    cluster: live
    # version：10
    config_key: live_auxiliary

# 转换后：

output_adapters:
- risk_predict:
    cluster: live
    config_key: live_auxiliary

所以只是简单的调用开源库封装好的函数是不能解决我们的问题的 💦

使用 CST 一点点抠数据结构

CST：具体语法树（指把 yaml 数据细节转换为清晰的树结构，便于修改和识别）

示例如下：

我和 yaml 数据转化大战三百回合的故事

所以我进行了如下的处理：

import { Parser } from 'yaml'

const [doc] = new Parser().parse(yaml) // 解析 yaml 为 CST 格式

CST.visit(doc, (item, path) => { // 进行 CST 的遍历处理
if (!CST.isScalar(item.value)) return
if (item?.key?.source === 'feature_pack_key') {// 找到需要修改的指定 key
    const currentList = CST.visit.parentCollection(doc, path) // 获取指定 key 的节点同层级缩进元素
    const idx = path[path.length - 1][1] // 指定 key 节点在上一层级节点中的 index
    const { indent } = item.value // 指定 key 节点的缩进值
    
    // 插入想要插入的 yaml key: value 数据
    currentList.items.splice(idx + 1, 0, {
        start: item.start.slice(),
        value: CST.createScalarToken(changeList[indexInChangeList]?.feature_pack_version, { indent }),
        sep: item.sep.slice(),
        key: CST.createScalarToken('feature_pack_version', {  end: [],indent })
    },{
        start: item.start.slice(),
        value: CST.createScalarToken(changeList[indexInChangeList]?.feature_set_version, { indent }),
        sep: item.sep.slice(),
        key: CST.createScalarToken('feature_set_version', {  end: [],indent })
    })
    return idx + 2 // 跳过需要遍历的的 index 值
}
})

缩进丢失问题

按照上述逻辑执行后，我发现在一些场景可以正常执行，但是在一些场景，新加入字段的缩进会丢失！

 # 普通case:

retrievers:
- feature_core:
    cluster: live
    feature_pack_key: xxx
 
 # 正常添加后：
 
retrievers:
- feature_core:
    cluster: live
    feature_pack_key: xxx
    feature_pack_version: 1
    feature_set_version: 2
    
 # 特殊 case:
 
 retrievers:
- feature_core:
    feature_pack_key: xxx
    cluster: live
    
 #  ca特殊se 执行添加后:

 retrievers:
- feature_core:
    feature_pack_key: xxx
feature_pack_version: 1
feature_set_version: 2
    cluster: live

我们已经在上面添加了需要的 indent 值了，为什么还会丢失缩进呢？

经过排查发现，一行内容的缩进不仅与 key 中的 indent 有关，也与 start 字段相关，标准 CST 的第一行的 start 为 []，后续行数都有自己的固定值。

所以 bug 表现为如果目标 key 在第一行时，我们给后续添加元素赋予同样的 start 时，就添加了同样的 []，进而导致了缩进丢失问题。而目标 key 不在第一行时，它的 start 是有正常值的，把同样的值赋予给后续添加元素就不会有问题。

let { start } =  currentList?.items[currentList?.items.length - 1] //获取指定 key 的节点同层级缩进元素的最后一个

// 当这个值为 [] 时（同层及只有一个元素），手动计算 start 值，不为 [] 就直接拿去用
start  = start.length === 0 ? [{    indent: 0,    source: new Array(indent).fill(' ').join(''),     type: "space"}] : start;

// ...
// 插入想要插入的 yaml key: value 数据，使用新的 start 值
currentList.items.splice(idx + 1, 0, {
    start: start,
    value: CST.createScalarToken(changeList[indexInChangeList]?.feature_pack_version, { indent }),
    sep: item.sep.slice(),
    key: CST.createScalarToken('feature_pack_version', {  end: [],indent })
},{
    start: start,
    value: CST.createScalarToken(changeList[indexInChangeList]?.feature_set_version, { indent }),
    sep: item.sep.slice(),
    key: CST.createScalarToken('feature_set_version', {  end: [],indent })
})
// ...

所以我在这里后续添加了手动计算 start 的逻辑来保持缩进

注释位置导致的特殊 bug

开发过程中发现一个特殊的 bug，当在数组的第一个元素（feature_core）前出现注释时，转换后导致除第一行外的注释丢失（# todo this is draft）并且丢失了第一行注释末尾（# juno）的换行符，导致第一个数组 key 和注释连成了一行！直接破坏了 yaml 的正常结构。

retrievers:
# juno
# todo this is draft
- feature_core:
    # todo
    # cluster: video_cluster
    cluster: live
 
 # 转换后：
 
 retrievers:
# juno- feature_core:
    # todo
    # cluster: video_cluster
    cluster: live
    
 # specialDealWithArrayAndCommentInYAML 函数处理后：
 
 retrievers:
# juno
- feature_core:
    # todo
    # cluster: video_cluster
    cluster: live

排查时找了一圈发现应该是 yaml 库的bug，我这边的处理方法是递归 遍历 所有 CST ，当判断某一行最后一个元素是注释，且结尾没有换行符时给手动添加上换行符，这个办法防止了正常的 yaml 的正常结构被破坏，但是不能找回丢失的注释，算是一个权宜之计。完美处理还是需要等开源库自己修复 bug！

简易解法：

 // yaml library have a bug about comment, if some comments above the first item of array, the comments will lack '\n'
    specialDealWithArrayAndCommentInYAML(doc){
        if(doc?.value?.items?.length > 0){
            for(let i = 0;i < doc?.value?.items?.length; i++){
                const typeArray = doc?.value?.items[i]?.sep?.map(item=>item?.type)
                for(let j = 0;j < typeArray?.length; j++){
                    if(typeArray[j] === 'comment' && j === typeArray?.length - 1){
                        doc?.value?.items[i]?.sep.splice(typeArray?.length, 0, {
                            ...typeArray[j],
                            source: "\n",
                            type: "newline"
                        })
                    }
                }
                this.specialDealWithArrayAndCommentInYAML(doc?.value?.items[i])
            }
        }   
    },

后记

整 yaml 这个玩意头发至少掉 1000 根

转载自:https://juejin.cn/post/7260389344669335608