The data processing feature of CKafka Connector provides the capability to extract message content based on regular expressions. Regular expression extraction uses the open-source regular expression package re2. Java's standard regular expression package java.util.regex
and other widely used regular expression packages, such as PCRE, Perlre, and Python(re), use the backtracking policy. That is, when two options a|b
are available for a pattern, the engine will first try to match a
. If the match fails, it will reset the input stream and try to match b
.
If the matching pattern is deeply nested, the policy requires exponential nested parsing of the input data. If the input string is very long, the matching time can be infinitely long
In contrast, the RE2J algorithm uses a nondeterministic finite automaton (NFA) to check all matches in a single parse of the input data, achieving regular expression matching in linear time.
Regular expression extraction of data processing applies to the extraction of specific fields from messages of long array types. Some common extraction patterns are described below.
Example 1: Extracting the Phone Number Field
Input message:
{"message":
[
{"email":123456@qq.com,"phoneNumber":"13890000000","IDNumber":"130423199301067425"},
{"email":123456789@163.com,"phoneNumber":"15920000000","IDNumber":"610630199109235723"},
{"email":usr333@gmail.com,"phoneNumber":"18830000000","IDNumber":"42060219880213301X"}
]
}
Output message:
{
"0": "\\"phoneNumber\\":\\"13890000000\\"",
"1": "\\"phoneNumber\\":\\"15920000000\\"",
"2": "\\"phoneNumber\\":\\"18830000000\\""
}
The regular expression used is:
"phoneNumber":"(13[0-9]|14[5|7]|15[0|1|2|3|5|6|7|8|9]|18[0|1|2|3|5|6|7|8|9])\\d{8}"
Example 2: Extracting the Email Field
Input message:
{"message":
[
{"email":123456@qq.com,"phoneNumber":"13890000000","IDNumber":"130423199301067425"},
{"email":123456789@163.com,"phoneNumber":"15920000000","IDNumber":"610630199109235723"},
{"email":usr333@gmail.com,"phoneNumber":"18830000000","IDNumber":"42060219880213301X"}
]
}
Output message:
{
"0": "\\"email\\":\\"123456@qq.com\\"",
"1": "\\"email\\":\\"123456789@163.com\\"",
"2": "\\"email\\":\\"usr333@gmail.com\\""
}
The regular expression used is:
"email":"\\w+([-+.]\\w+)*@\\w+([-.]\\w+)*\\.\\w+([-.]\\w+)*"
Example 3: Extracting the ID Number Field
Input message:
{
"@timestamp": "2022-02-26T22:25:33.210Z",
"input_type": "log",
"operation": "INSERT",
"operator": "admin",
"message": "{\\"email\\":\\"123456@qq.com\\",\\"phoneNumber\\":\\"13890000000\\",\\"IDNumber\\":\\"130423199301067425\\"},{\\"email\\":\\"123456789@163.com\\",\\"phoneNumber\\":\\"15920000000\\",\\"IDNumber\\":\\"610630199109235723\\"},{\\"email\\":\\"usr333@gmail.com\\",\\"phoneNumber\\":\\"18830000000\\",\\"IDNumber\\":\\"42060219880213301X\\"}"
}
Output message. Retain other fields and extract N IDNumber fields from the message separately:
{
"@timestamp": "2022-02-26T22:25:33.210Z",
"input_type": "log",
"operation": "INSERT",
"operator": "admin",
"message.0": "130423199301067425",
"message.1": "610630199109235723",
"message.2": "42060219880213301X"
}
The used regular expression is:
[1-9]\\d{5}(18|19|20)\\d{2}((0[1-9])|(1[0-2]))(([0-2][1-9])|10|20|30|31)\\d{3}[0-9Xx]
Multiple processing chains are used, and the result of the first processing chain is as follows:
The message field needs to be further processed, and the result of the second processing chain is as follows:
Processing result:
{
"@timestamp": "2022-02-26T22:25:33.210Z",
"input_type": "log",
"operation": "INSERT",
"operator": "admin",
"message.0": "130423199301067425",
"message.1": "610630199109235723",
"message.2": "42060219880213301X"
}
The required N IDNumber fields are extracted, the original message field is deleted, and other fields such as operation are retained.
Was this page helpful?