Configuration Item | Description |
Directory Prefix | Log files will be shipped to the corresponding directory in the COS bucket, which is generally the address of the table location in a data warehouse model. |
Partition Format | A shipping task can automatically partition data by creation time. We recommend you specify the partition format according to the Hive partitioned table format.For example, to partition by day, you can set /dt=%Y%m%d/test. Here, dt= indicates the partitioning field, %Y%m%d indicates the year, month, and day, and test indicates the log file prefix. As the name of a shipped file start with an underscore ((_)) by default, the big data computing engine will ignore such files and cause a failure to find the data. Therefore, you need to add a prefix, and the actual partition directory name will be dt=20220424 for example. |
Shipping Interval | You can select 5–15 minutes. We recommend you select 15 minutes, 250 MB. In this case, the number of files will be lower, and the query performance will be high. |
Shipping Format | The JSON format is recommended. |
log_data
logset, the directory structure will be as shown below, where specific log files are stored in partition directories.location
field must match the directory structure.__TIMESTAMP__
field is of int
type, but maybe the bigint
type should be used to meet the business requirements.CREATE EXTERNAL TABLE IF NOT EXISTS `DataLakeCatalog`.`test`.`log_data` (`__FILENAME__` string,`__SOURCE__` string,`__TIMESTAMP__` bigint,`appId` string,`caller` string,`consumeTime` string,`data` string,`datacontenttype` string,`deliveryStatus` string,`errorResponse` string,`eventRuleId` string,`eventbusId` string,`eventbusType` string,`id` string,`logTime` string,`region` string,`requestId` string,`retryNum` string,`source` string,`sourceType` string,`specversion` string,`status` string,`subject` string,`tags` string,`targetId` string,`targetSource` string,`time` string,`type` string,`uin` string) PARTITIONED BY (`dt` string) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' STORED AS TEXTFILE LOCATION 'cosn://coreywei-1253240642/log_data/'
location
must point to the cosn://coreywei-1253240642/log_data/
directory instead of the cosn://coreywei-1253240642/log_data/20220423/
directory.cosn://coreywei-1253240642/log_data/20220423/
. After inference is completed, change location
in the SQL statement back to cosn://coreywei-1253240642/log_data/
.SELECT
statement to get data from a partitioned table only after adding partitions in the following two ways:msck repair table DataLakeCatalog.test.log_data;
alter table DataLakeCatalog.test.log_data add partition(dt='20220424')
select dt,count(1) from `DataLakeCatalog`.`test`.`log_data` group by dt;
Was this page helpful?