Content | Description |
Language types | Supports Mandarin, Cantonese, English, Korean, Japanese, Thai, Indonesian, Malay, Arabic, etc. The corresponding language type can be set through the interface parameter engine_model_type |
Supported industries | General, Finance, Gaming, Education, Medical |
Audio properties | Sampling Rate: 16000Hz or 8000Hz Sampling Accuracy: 16bits Audio Track: Mono |
Audio format | pcm,wav,opus,speex,silk,mp3,m4a,aac |
Request protocol | wss Protocol |
Request address | wss://asr.cloud.tencent.com/asr/v2/<appid>?{request parameters} |
Interface Authentication | Signature authentication mechanism, see Signature generation |
Response Format | Unified JSON format |
Data Transmission | It is recommended to send data packets with a duration of 40ms every 40ms (i.e., 1:1 real-time rate). For PCM, the data size is: 640 bytes at 8k sampling rate, 1280 bytes at 16k sampling rate Audio transmission rate that is too fast (exceeds 1:1 real-time rate) or an interval between audio packets exceeding 6 seconds may cause engine errors, and the backend will return an error and proactively disconnect the connection |
Concurrency Limitation | By default, the concurrent connection limit for a single account is set to 20. If you need to increase this limit, please submit a ticket for consultation. |
Field name | Type | Description |
code | Integer | Status Codes: 0 means normal, non-zero values indicate an error occurred |
message | String | Error Description: Displays the specific reason for the error. This text may be frequently updated or changed as the business develops or to optimize the user experience |
voice_id | String | Unique Audio Stream ID: Generated by the client during the handshake phase and assigned in the call parameters |
message_id | String | Unique message ID |
result | Result | Latest ASR Result |
final | Integer | When this field returns 1, it indicates that the audio stream recognition is complete |
Field name | Type | Description |
slice_type | Integer | Recognition Result Type: 0: Start of a new paragraph recognition 1: Recognizing a paragraph, voice_text_str is a non-steady-state result (the recognition result of this paragraph may change) 2: Recognition of a paragraph completed, voice_text_str is a steady-state result (the recognition result of this paragraph will no longer change) Based on the transmitted audio, the possible slice_type sequence that may return during recognition includes: 0-1-2: Start of a new paragraph recognition, in-progress recognition (multiple returns of 1 possible), recognition completed 0-2: Start of a new paragraph recognition, recognition completed 2: Directly return the complete recognition result of a paragraph |
index | Integer | The current paragraph result's sequence number in the entire audio stream, incrementing from 0 for each sentence |
start_time | Integer | The start time of the current paragraph result in the entire audio stream |
end_time | Integer | The end time of the current paragraph result in the entire audio stream |
voice_text_str | String | The text result of the current paragraph, encoded in UTF-8 |
word_size | Integer | The number of words in the current paragraph result |
word_list | Word Array | The word list for the current paragraph, Word structure format is: word: String type, the content of the word start_time: Integer type, the start time of the word in the entire audio stream end_time: Integer type, the end time of the word in the entire audio stream stable_flag: Integer type, the steady-state result of the word. 0 indicates that the word may change in subsequent recognition, 1 indicates that the word will not change in subsequent recognition |
wss://asr.cloud.tencent.com/asr/v2/<appid>?{request parameters}
key1=value1&key2=value2...(both key and value need to be URL encoded)
Parameter name | Required | Type | Description |
secretid | Yes | String | SecretId of the Tencent Cloud registered account, can be obtained through the API Key Management Page |
timestamp | Yes | Integer | Current UNIX timestamp in seconds. If the difference with the current time is too large, it will cause a signature expiration error |
expired | Yes | Integer | The UNIX timestamp of the signature's expiration time, in seconds. expired must be greater than timestamp and expired - timestamp less than 90 days |
nonce | Yes | Integer | Random positive integer. Users need to generate it themselves, up to 10 digits |
engine_model_type | Yes | String | Engine Model Type Telephone Scenario: • 8k_zh: Chinese Telephone General; • 8k_en: English Telephone General; Non-Telephone Scenario: • 16k_zh_large: General English large model engine [large model version]. The current model supports recognition of Chinese, English, various Chinese dialects, etc., with a large number of model parameters, enhanced language model performance, and greatly improved recognition accuracy for low-quality audio such as high noise, high echo, low human voice, and distant human voice; • 16k_zh: Mandarin Chinese General; • 16k_yue: Cantonese; • 16k_zh-TW: Traditional Chinese; • 16k_ar: Arabic; • 16k_en: English; • 16k_ko: Korean; • 16k_ja: Japanese; • 16k_th: Thai; • 16k_id: Indonesian; • 16k_ms: Malay; |
voice_id | Yes | String | 16-bit String as a unique identifier for each audio, generated by the user |
voice_format | No | Int | Speech Encoding method, optional, default value is 4.1:pcm;4:speex(sp);6:silk;8:mp3;10:opus(opus format audio stream packaging description);12:wav;14:m4a(each fragment must be a complete m4a audio);16:aac |
needvad | No | Integer | 0: disable vad, 1: enable vad If the audio fragment length exceeds 60 seconds, users need to enable vad (voice activity detection feature) |
hotword_id | No | String | Hotword list id. If this parameter is not set, the default hotword list will automatically take effect. If this parameter is set, the corresponding hotword list will take effect |
reinforce_hotword | No | Integer | Hotword enhancement feature. Default is 0, 0: disabled, 1: enabled. After enabling (only supported for 8k_zh, 16k_zh), the homophonic substitution feature will be activated, replacing homophones and words in the hotword list. For example: After setting the hotword "蜜制" and enabling the enhancement feature, recognition results of words with the same pronunciation (mizhi) as "蜜制", such as "秘制", "蜜汁", will be forcibly replaced with "蜜制". Therefore, it is recommended that customers enable this feature based on their actual situation. |
customization_id | No | String | Self-learning model id. If this parameter is not set, the last self-learning model to go online will automatically take effect. If this parameter is set, the corresponding self-learning model will take effect |
filter_dirty | No | Integer | Whether to filter profanity (currently supports Mandarin Chinese engine). The default is 0. 0: Do not filter profanity; 1: Filter profanity; 2: Replace profanity with " * " |
filter_modal | No | Integer | Whether to filter modal particles (currently supports Mandarin Chinese engine). The default is 0. 0: Do not filter modal particles; 1: Partially filter; 2: Strictly filter |
filter_punc | No | Integer | Whether to filter sentence-ending periods (currently supports Mandarin Chinese engine). The default is 0. 0: Do not filter sentence-ending periods; 1: Filter sentence-ending periods |
filter_empty_result | No | Integer | Whether to callback for empty results, default is 1.0: callback empty results; 1: do not callback empty results; Note: If you need to pair callbacks with slice_type=0 and slice_type=2, you need to set filter_empty_result=0. Pairing returns is typically required in outbound scenarios, and slice_type=0 is used to determine the presence of human voice. |
convert_num_mode | No | Integer | Whether to perform intelligent conversion of Arabic numerals (currently supported by the Mandarin Chinese engine). 0: do not convert, output Chinese numerals directly, 1: intelligently convert to Arabic numerals according to the scenario, 3: enable math-related digit conversion. Default is 1 |
word_info | No | Int | Whether to display word-level timestamps. 0: do not display; 1: display, do not include punctuation timestamps, 2: display, include punctuation timestamps. Supported engines: 8k_en, 8k_zh, 8k_zh_finance, 16k_zh, 16k_en, 16k_ca, 16k_zh-TW, 16k_ja, 16k_wuu-SH, default is 0 |
vad_silence_time | No | Integer | Speech segmentation detection threshold. Silence duration exceeding this threshold will be considered a break (mainly used in intelligent customer service scenarios, needs to be used with needvad = 1). The value range is 240-2000 ms. It is recommended not to adjust this parameter arbitrarily as it may affect recognition results. Currently, it is only supported by the 8k_zh, 8k_zh_finance, and 16k_zh engine models |
max_speak_time | No | Integer | Forced segmentation feature, value range 5000-90000 (unit: milliseconds), default value is 0 (not enabled). In the case of continuous speech without interruption, this parameter will enforce segmentation (the result becomes stable, slice_type=2). For example: In a game commentary scenario, if the commentator continues to speak without interruption and cannot segment the speech, setting this parameter to 10000 will receive a slice_type=2 callback every 10 seconds. |
noise_threshold | No | Float | Noise parameter threshold, default is 0, range: [-1,1]. For some audio segments, the higher the value, the more likely it is to be detected as noise. The lower the value, the more likely it is to be detected as human voice. Use with caution: may affect recognition accuracy |
signature | Yes | String | API signature parameter |
hotword_list | No | String | Temporary hotword list: This parameter is used to improve recognition accuracy. Single hotword limit: "hotword|weight", each hotword should not exceed 30 characters (maximum 10 Chinese characters), weight ranges from 1-11, e.g., "Tencent Cloud|5" or "ASR|11"; Temporary hotword list limit: Multiple hotwords are separated by English commas, with a maximum of 128 hotwords supported, e.g., "Tencent Cloud|10, ASR|5, ASR|11"; Difference between the parameters hotword_id (hotword list) and hotword_list (temporary hotword list): hotword_id: hotword list. You need to create a hotword list in the console or via the API to obtain the corresponding hotword_id to use the hotword feature; hotword_list: temporary hotword list. Directly pass in the temporary hotword list for each request to use the hotword feature. The cloud does not retain the temporary hotword list. Suitable for users with a large demand for hotwords; Note: If both hotword_id and hotword_list are passed in, hotword_list will be used first; When the hotword weight is set to 11, the current hotword will be upgraded to a super hotword. It is recommended to set only important and must-be-effective hotwords to 11. Setting too many hotwords with a weight of 11 will affect the overall word accuracy rate. |
input_sample_rate | No | Interge | Supports 8k audio in pcm format to be upsampled to 16k when the sampling rate does not match the engine, effectively improving recognition accuracy. Only supports: 8000. For example: if 8000 is passed in, then the pcm audio sampling rate is 8k. When the engine is selected to 16k_zh, the 8k sampling rate of the pcm audio can be recognized normally under the 16k_zh engine. Note: this parameter is only applicable to pcm format audio. If no value is passed in, the default state will be maintained, which means the default engine sampling rate is equal to the pcm audio sampling rate. |
asr.cloud.tencent.com/asr/v2/1259228442?engine_model_type=16k_zh&expired=1592380492&filter_dirty=1&filter_modal=1&filter_punc=1&needvad=1&nonce=1592294092123&secretid=AKIDoQq1zhZMN8dv0psmvud6OUKuGPO7pu0r×tamp=1592294092&voice_format=1&voice_id=RnKu9FODFHK5FPpsrN
Base64Encode(HmacSha1("asr.cloud.tencent.com/asr/v2/1259228442?engine_model_type=16k_zh&expired=1592380492&filter_dirty=1&filter_modal=1&filter_punc=1&needvad=1&nonce=1592294092123&secretid=AKIDoQq1zhZMN8dv0psmvud6OUKuGPO7pu0r×tamp=1592294092&voice_format=1&voice_id=RnKu9FODFHK5FPpsrN", "kFpwoX5RYQ2SkqpeHgqmSzHK7h3A2fni"))
HepdTRX6u155qIPKNKC+3U0j1N0=
wss://asr.cloud.tencent.com/asr/v2/1259228442?engine_model_type=16k_zh&expired=1592380492&filter_dirty=1&filter_modal=1&filter_punc=1&needvad=1&nonce=1592294092123&secretid=AKIDoQq1zhZMN8dv0psmvud6OUKuGPO7pu0r×tamp=1592294092&voice_format=1&voice_id=RnKu9FODFHK5FPpsrN&signature=HepdTRX6u155qIPKNKC%2B3U0j1N0%3D
OpusHead (4 bytes) | Frame data length (2 bytes) | An Opus frame of compressed data |
opus | Length len | Corresponding opus decode data of length len |
{"code":0,"message":"success","voice_id":"RnKu9FODFHK5FPpsrN"}
{"type": "end"}
{"code":0,"message":"success","voice_id":"RnKu9FODFHK5FPpsrN","message_id":"RnKu9FODFHK5FPpsrN_11_0","result":{"slice_type":0,"index":0,"start_time":0,"end_time":1240,"voice_text_str":"real-time","word_size":0,"word_list":[]}}
{"code":0,"message":"success","voice_id":"RnKu9FODFHK5FPpsrN","message_id":"RnKu9FODFHK5FPpsrN_33_0","result":{"slice_type":2,"index":0,"start_time":0,"end_time":2840,"voice_text_str":"real-time ASR","word_size":0,"word_list":[]}}
{"code":0,"message":"success","voice_id":"CzhjnqBkv8lk5pRUxhpX","message_id":"CzhjnqBkv8lk5pRUxhpX_241","final":1}
{"code":4008,"message":"background recognition server audio shard wait timeout","voice_id":"CzhjnqBkv8lk5pRUxhpX","message_id":"CzhjnqBkv8lk5pRUxhpX_241"}
Value | Description |
4001 | Invalid Parameter, see message for details |
4002 | Authentication failed |
4003 | AppID Service Not Activated, please activate the service in the console |
4004 | No available free quota |
4005 | Account arrears. Service stopped, please recharge in time |
4006 | The account's concurrent calls limit is exceeded |
4007 | Audio decoding failed. Please check that the uploaded audio data format is consistent with the call parameters |
4008 | Client data upload timeout |
4009 | Client connection closed |
4010 | Client uploaded an unknown text message |
5000 | Background error, please try again |
5001 | Background recognition server recognition failure, please try again |
5002 | Background recognition server recognition failure, please try again |
Was this page helpful?