tencent cloud

피드백

Regular Expression Extraction

마지막 업데이트 시간:2024-11-07 11:40:11
    The data processing feature of CKafka Connector provides the capability to extract message content based on regular expressions. Regular expression extraction uses the open-source regular expression package re2.
    Java's standard regular expression package java.util.regex and other widely used regular expression packages, such as PCRE, Perlre, and Python(re), use the backtracking policy. That is, when two options a|b are available for a pattern, the engine will first try to match a. If the match fails, it will reset the input stream and try to match b.
    If the matching pattern is deeply nested, the policy requires exponential nested parsing of the input data. If the input string is very long, the matching time can be infinitely long
    In contrast, the RE2J algorithm uses a nondeterministic finite automaton (NFA) to check all matches in a single parse of the input data, achieving regular expression matching in linear time.
    Regular expression extraction of data processing applies to the extraction of specific fields from messages of long array types. Some common extraction patterns are described below.

    Example 1: Extracting the Phone Number Field

    Input message:
    {"message":
    [
    {"email":123456@qq.com,"phoneNumber":"13890000000","IDNumber":"130423199301067425"},
    {"email":123456789@163.com,"phoneNumber":"15920000000","IDNumber":"610630199109235723"},
    {"email":usr333@gmail.com,"phoneNumber":"18830000000","IDNumber":"42060219880213301X"}
    ]
    }
    Output message:
    {
    "0": "\\"phoneNumber\\":\\"13890000000\\"",
    "1": "\\"phoneNumber\\":\\"15920000000\\"",
    "2": "\\"phoneNumber\\":\\"18830000000\\""
    }
    The regular expression used is:
    "phoneNumber":"(13[0-9]|14[5|7]|15[0|1|2|3|5|6|7|8|9]|18[0|1|2|3|5|6|7|8|9])\\d{8}"
    

    Example 2: Extracting the Email Field

    Input message:
    {"message":
    [
    {"email":123456@qq.com,"phoneNumber":"13890000000","IDNumber":"130423199301067425"},
    {"email":123456789@163.com,"phoneNumber":"15920000000","IDNumber":"610630199109235723"},
    {"email":usr333@gmail.com,"phoneNumber":"18830000000","IDNumber":"42060219880213301X"}
    ]
    }
    Output message:
    {
    "0": "\\"email\\":\\"123456@qq.com\\"",
    "1": "\\"email\\":\\"123456789@163.com\\"",
    "2": "\\"email\\":\\"usr333@gmail.com\\""
    }
    The regular expression used is:
    "email":"\\w+([-+.]\\w+)*@\\w+([-.]\\w+)*\\.\\w+([-.]\\w+)*"
    

    Example 3: Extracting the ID Number Field

    Input message:
    {
    "@timestamp": "2022-02-26T22:25:33.210Z",
    "input_type": "log",
    "operation": "INSERT",
    "operator": "admin",
    "message": "{\\"email\\":\\"123456@qq.com\\",\\"phoneNumber\\":\\"13890000000\\",\\"IDNumber\\":\\"130423199301067425\\"},{\\"email\\":\\"123456789@163.com\\",\\"phoneNumber\\":\\"15920000000\\",\\"IDNumber\\":\\"610630199109235723\\"},{\\"email\\":\\"usr333@gmail.com\\",\\"phoneNumber\\":\\"18830000000\\",\\"IDNumber\\":\\"42060219880213301X\\"}"
    }
    Output message. Retain other fields and extract N IDNumber fields from the message separately:
    {
    "@timestamp": "2022-02-26T22:25:33.210Z",
    "input_type": "log",
    "operation": "INSERT",
    "operator": "admin",
    "message.0": "130423199301067425",
    "message.1": "610630199109235723",
    "message.2": "42060219880213301X"
    }
    The used regular expression is:
    [1-9]\\d{5}(18|19|20)\\d{2}((0[1-9])|(1[0-2]))(([0-2][1-9])|10|20|30|31)\\d{3}[0-9Xx]
    Multiple processing chains are used, and the result of the first processing chain is as follows:
    
    The message field needs to be further processed, and the result of the second processing chain is as follows:
    
    
    Processing result:
    {
    "@timestamp": "2022-02-26T22:25:33.210Z",
    "input_type": "log",
    "operation": "INSERT",
    "operator": "admin",
    "message.0": "130423199301067425",
    "message.1": "610630199109235723",
    "message.2": "42060219880213301X"
    }
    The required N IDNumber fields are extracted, the original message field is deleted, and other fields such as operation are retained.
    
    문의하기

    고객의 업무에 전용 서비스를 제공해드립니다.

    기술 지원

    더 많은 도움이 필요하시면, 티켓을 통해 연락 바랍니다. 티켓 서비스는 연중무휴 24시간 제공됩니다.

    연중무휴 24시간 전화 지원