tencent cloud

Feedback

Regular Expression Extraction

Last updated: 2024-11-07 11:40:11
    The data processing feature of CKafka Connector provides the capability to extract message content based on regular expressions. Regular expression extraction uses the open-source regular expression package re2.
    Java's standard regular expression package java.util.regex and other widely used regular expression packages, such as PCRE, Perlre, and Python(re), use the backtracking policy. That is, when two options a|b are available for a pattern, the engine will first try to match a. If the match fails, it will reset the input stream and try to match b.
    If the matching pattern is deeply nested, the policy requires exponential nested parsing of the input data. If the input string is very long, the matching time can be infinitely long
    In contrast, the RE2J algorithm uses a nondeterministic finite automaton (NFA) to check all matches in a single parse of the input data, achieving regular expression matching in linear time.
    Regular expression extraction of data processing applies to the extraction of specific fields from messages of long array types. Some common extraction patterns are described below.

    Example 1: Extracting the Phone Number Field

    Input message:
    {"message":
    [
    {"email":123456@qq.com,"phoneNumber":"13890000000","IDNumber":"130423199301067425"},
    {"email":123456789@163.com,"phoneNumber":"15920000000","IDNumber":"610630199109235723"},
    {"email":usr333@gmail.com,"phoneNumber":"18830000000","IDNumber":"42060219880213301X"}
    ]
    }
    Output message:
    {
    "0": "\\"phoneNumber\\":\\"13890000000\\"",
    "1": "\\"phoneNumber\\":\\"15920000000\\"",
    "2": "\\"phoneNumber\\":\\"18830000000\\""
    }
    The regular expression used is:
    "phoneNumber":"(13[0-9]|14[5|7]|15[0|1|2|3|5|6|7|8|9]|18[0|1|2|3|5|6|7|8|9])\\d{8}"
    

    Example 2: Extracting the Email Field

    Input message:
    {"message":
    [
    {"email":123456@qq.com,"phoneNumber":"13890000000","IDNumber":"130423199301067425"},
    {"email":123456789@163.com,"phoneNumber":"15920000000","IDNumber":"610630199109235723"},
    {"email":usr333@gmail.com,"phoneNumber":"18830000000","IDNumber":"42060219880213301X"}
    ]
    }
    Output message:
    {
    "0": "\\"email\\":\\"123456@qq.com\\"",
    "1": "\\"email\\":\\"123456789@163.com\\"",
    "2": "\\"email\\":\\"usr333@gmail.com\\""
    }
    The regular expression used is:
    "email":"\\w+([-+.]\\w+)*@\\w+([-.]\\w+)*\\.\\w+([-.]\\w+)*"
    

    Example 3: Extracting the ID Number Field

    Input message:
    {
    "@timestamp": "2022-02-26T22:25:33.210Z",
    "input_type": "log",
    "operation": "INSERT",
    "operator": "admin",
    "message": "{\\"email\\":\\"123456@qq.com\\",\\"phoneNumber\\":\\"13890000000\\",\\"IDNumber\\":\\"130423199301067425\\"},{\\"email\\":\\"123456789@163.com\\",\\"phoneNumber\\":\\"15920000000\\",\\"IDNumber\\":\\"610630199109235723\\"},{\\"email\\":\\"usr333@gmail.com\\",\\"phoneNumber\\":\\"18830000000\\",\\"IDNumber\\":\\"42060219880213301X\\"}"
    }
    Output message. Retain other fields and extract N IDNumber fields from the message separately:
    {
    "@timestamp": "2022-02-26T22:25:33.210Z",
    "input_type": "log",
    "operation": "INSERT",
    "operator": "admin",
    "message.0": "130423199301067425",
    "message.1": "610630199109235723",
    "message.2": "42060219880213301X"
    }
    The used regular expression is:
    [1-9]\\d{5}(18|19|20)\\d{2}((0[1-9])|(1[0-2]))(([0-2][1-9])|10|20|30|31)\\d{3}[0-9Xx]
    Multiple processing chains are used, and the result of the first processing chain is as follows:
    
    The message field needs to be further processed, and the result of the second processing chain is as follows:
    
    
    Processing result:
    {
    "@timestamp": "2022-02-26T22:25:33.210Z",
    "input_type": "log",
    "operation": "INSERT",
    "operator": "admin",
    "message.0": "130423199301067425",
    "message.1": "610630199109235723",
    "message.2": "42060219880213301X"
    }
    The required N IDNumber fields are extracted, the original message field is deleted, and other fields such as operation are retained.
    
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support