Content | Description |
Language type | Supports Mandarin, Cantonese, English, Korean, Japanese, Thai, Indonesian, Malay, and Arabic. The corresponding language type can be set through the API parameter engine_model_type. |
Supports industries | Common, finance, gaming, education, health care |
audio attribute | Sampling Rate: 16000 Hz or 8000 Hz Sampling Accuracy: 16 bits Sound channel: mono |
Audio format | pcm、wav、opus、speex、silk、mp3、m4a、aac |
request protocol | wss protocol |
Request URL | wss://asr.cloud.tencent.com/asr/v2/<appid>?{request parameters} |
API authentication | signature authentication mechanism. For details, see Signature Generation |
response format | Unify and use JSON format |
data transmission | It is recommended to send a 40ms duration data packet every 40ms (1:1 real-time rate), corresponding to pcm sizes of 640 bytes at 8k sampling rate and 1280 bytes at 16k sampling rate. The audio sending rate is too fast, exceeding the 1:1 real-time rate, or the sending interval between audio data packets exceeds 6 seconds, which may cause an engine error. The backend will return an error and actively disconnect. |
concurrency limit | The default single account concurrent connection limit is 20. If you need to increase the concurrent limit, submit a ticket for consultation. |
Field Name | Type | Description |
code | Integer | Status code; 0 indicates normal operation, non-zero values indicate errors. |
message | String | Error description. When an error occurs, display the reason for the error occurrence. As the business develops or experience optimization progresses, this text may be frequently updated. |
voice_id | String | unique audio stream id, generated by the client during the handshake phase and assigned to the API call parameters |
message_id | String | unique message id |
result | Result | latest speech recognition result |
final | Integer | When this field returns 1, it means the audio stream recognition is completed. |
Field Name | Type | Description |
slice_type | Integer | Recognition result type: 0: Start of a Sentence Recognition 1: In the process of sentence recognition, voice_text_str is an Unstable Result (the recognition result may still change). 2: End of sentence recognition, voice_text_str is a Steady-State Result (the recognition result no longer changes). During audio sending, the slice_type sequence that may be returned during the recognition process includes: 0-1-2: Start of sentence recognition, recognition in progress (may return multiple 1s), recognition completed 0-2: Start of sentence recognition, recognition completed Return directly the complete recognition result of a paragraph |
index | Integer | Sequence number of the current sentence in the entire audio stream, starting from 0 and incrementing sentence by sentence |
start_time | Integer | Start time of the current sentence in the audio stream |
end_time | Integer | End time of the current sentence in the audio stream |
voice_text_str | String | Current paragraph text result, coded as UTF8 |
word_size | Integer | Number of word results in the current paragraph |
word_list | Word Array | Word list of the current sentence, Word Structure Format: word: String type, content of the word start_time: Integer type, start time of the word in the entire audio stream end_time: Integer type, end time of the word in the entire audio stream stable_flag: Integer type, stable result of the word, 0 indicates the word may change in subsequent recognition, 1 indicates the word will not change in subsequent recognition |
wss://asr.cloud.tencent.com/asr/v2/<appid>?{request parameters}
key1=value2&key2=value2...(URL encode both key and value)
Parameter Name | Required | Type | Description |
secretid | Yes | String | |
timestamp | Yes | Integer | Current UNIX timestamp, unit: seconds. If the difference from the current time is too large, it will cause a signature expiration error. |
expired | Yes | Integer | Expiration time of the signature UNIX timestamp, in seconds. Expired must be greater than timestamp and expired - timestamp less than 90 days. |
nonce | Yes | Integer | Random positive integer. User needs to generate it themselves, up to 10 digits. |
engine_model_type | Yes | String | Engine Model Type Phone call scenario 8k_zh: Chinese telephone common 8k_en: English telephone common Non-phone call scenario 16k_zh_large: large model engine for Mandarin, Chinese dialects, and English [large model version]. The current model supports language recognition for Chinese, English, and multiple Chinese dialects, has a large number of parameters, and features language model performance enhancement. It greatly improves recognition accuracy against low-quality audio such as loud noise, strong echo, low voice volume, and voice from far away. 16k_zh: Mandarin common 16k_yue: Cantonese 16k_zh-TW: Chinese (Traditional) 16k_ar: Arabic 16k_en: English 16k_ko: Korean 16k_ja: Japanese 16k_th: Thai 16k_id: Indonesian 16k_ms: Malay |
voice_id | Yes | String | A 16-character String serves as the unique identifier for each audio, user-generated. |
voice_format | No | Int | Voice encoding method, Option, default value is 4. 1: pcm; 4: speex(sp); 6: silk; 8: mp3; 10: opus (Opus format audio stream packaging instructions); 12: wav; 14: m4a (each segment must be a complete m4a audio); 16: aac |
needvad | No | Integer | 0: Disable vad, 1: Enable vad If the voice segment exceeds 60 seconds, enable vad (voice detection and segmentation function). |
hotword_id | No | String | Hotword table id. If this parameter is not set, the default hotword list will automatically take effect; if this parameter is set, the hotword list will take effect. |
reinforce_hotword | No | Integer | Enhanced hotword feature. Default is 0, where 0: not enabled, 1: enable. When turned on (only supports 8k_zh, 16k_zh), the homophonic replacement function will be enabled. Homophones are configured in hotwords. For example: After the term "蜜制" is set and the enhancement feature is enabled, recognition results of words with the same pronunciation (mizhi) as "蜜制", such as "秘制" and "蜜汁", will be forcibly replaced with "蜜制". Therefore, it is recommended that customers enable this feature based on their actual situation. |
customization_id | No | String | self-learning model id. if this parameter is not set, the last launched self-learning model will take effect automatically; if this parameter is set, the self-learning model will take effect. |
filter_dirty | No | Integer | whether to filter profanity (Currently supports Mandarin engine). Default value is 0. 0: not filter profanity; 1: filter dirty words; 2: replace profanity with "*". |
filter_modal | No | Integer | whether to filter modal particles (Currently supports Mandarin engine). Default value is 0. 0: do not filter modal particles; 1: partial filtering; 2: strict filtering. |
filter_punc | No | Integer | whether to filter periods at the end of sentences (Currently supports Mandarin engine). Default value is 0. 0: does not filter periods at the end of sentences; 1: filters periods at the end of sentences. |
filter_empty_result | No | Integer | Callback recognition empty result, default is 1. 0: callback empty result; 1: Do Not Callback Empty Result. Note: If slice_type=0 and slice_type=2 paired callback is needed, set filter_empty_result=0. Generally needed in outbound call scenarios for paired return, use slice_type=0 to determine whether voice occurs. |
convert_num_mode | No | Integer | Whether to perform intelligent conversion of Arabic numerals (Currently supports Mandarin engine). 0: do not convert, directly output Chinese numbers, 1: intelligently convert to Arabic numerals based on the scenario, 3: enable math-related number conversion. Default value is 1. |
word_info | No | Int | Whether to display word-level timestamp. 0: do not display; 1: display, excluding punctuation timestamp; 2: display, including punctuation timestamp. Support for engines 8k_en, 8k_zh, 8k_zh_finance, 16k_zh, 16k_en, 16k_ca, 16k_zh-TW, 16k_ja, 16k_wuu-SH. Default is 0. |
vad_silence_time | No | Integer | Voice segmentation detection threshold. Silence duration exceeding the threshold will be considered as sentence segmentation (commonly used in customer service scenarios, must be used in conjunction with needvad = 1). Value ranges from 240 to 2000 ms. Do not adjust this parameter arbitrarily as it may affect recognition performance. Currently only supports 8k_zh, 8k_zh_finance, and 16k_zh engine models. |
max_speak_time | No | Integer | Forced segmentation feature, value ranges from 5000 to 90000 (unit: ms), default value 0 (not enabled). In continuous speaking without interruption, this parameter will implement forced segmentation (at this point the result changes into steady state, slice_type=2). For example: in gaming commentary scenarios, when the commentator continues uninterrupted commentary and sentence segmentation is unable, set this parameter to 10000 to receive slice_type=2 callbacks every 10 seconds. |
noise_threshold | No | Float | Noise parameter threshold, defaults to 0, value ranges from -1 to 1. For some audio clips, the larger the value, the more likely it is determined as noise condition. The smaller the value, the more likely it is determined as voice condition. Use with caution: may affect recognition accuracy |
signature | Yes | String | API signature parameters |
hotword_list | No | String | Temporary hot word list: this parameter is used for improve recognition accuracy. Single hot word limit: "hotword|weight", each hotword no more than 30 characters (maximum 10 Chinese characters), weight 1-11, for example: "Tencent Cloud|5" or "ASR|11"; Restrictions for the temporary term list: multiple terms separated by commas, supports up to 128 hotwords, for example: "Tencent Cloud|10, speech recognition|5, ASR|11"; hotword_id (hot word list) differs from hotword_list (temporary hot word list) hotword_id: hot word list. You must first create a hot word list on the console or via API, then obtain the corresponding hotword_id as the input parameter to use the hotword function. hotword_list: temporary hot word list. Each time a request is made, directly enter the temporary hot word list to use the hotword function. The list is not retained on the cloud. Suitable for users with a massive number of hot words demand. Note: If both hotword_id and hotword_list are provided, hotword_list will be used first. When term weight is set to 11, the current term will be upgraded to a super term. It is advisable to only set important and must-effective terms to 11. Setting too many terms with a weight of 11 will affect overall accuracy. |
input_sample_rate | No | Integer | pcm format 8k audio can be upsampled to 16k for recognition when the engine sampling rate is mismatched, effectively improving recognition accuracy. Only 8000 is supported. For example, if 8000 is input, the pcm audio sampling rate is 8k. When the engine selects 16k_zh, the 8k pcm audio can be recognized normally under the 16k_zh engine. Note: This parameter is applicable only to pcm format audio. If no input value is provided, it will maintain the default state, where the default call engine sampling rate equals the pcm audio sample rate. |
asr.cloud.tencent.com/asr/v2/125922**?engine_model_type=16k_zh&expired=1592380492&filter_dirty=1&filter_modal=1&filter_punc=1&needvad=1&nonce=1592294092123&secretid=*****Qq1zhZMN8dv0*****×tamp=1592294092&voice_format=1&voice_id=RnKu9FODFHK5FPpsrN
Base64Encode(HmacSha1("asr.cloud.tencent.com/asr/v2/125922**?engine_model_type=16k_zh&expired=1592380492&filter_dirty=1&filter_modal=1&filter_punc=1&needvad=1&nonce=1592294092123&secretid=*****Qq1zhZMN8dv0*****×tamp=1592294092&voice_format=1&voice_id=RnKu9FODFHK5FPpsrN", "kFpwoX5RYQ2SkqpeHgqmSzHK7h3A2fni"))
HepdTRX6u155qIPKNKC+3U0j1N0=
wss://asr.cloud.tencent.com/asr/v2/125922***?engine_model_type=16k_zh&expired=1592380492&filter_dirty=1&filter_modal=1&filter_punc=1&needvad=1&nonce=1592294092123&secretid=*****Qq1zhZMN8dv0*****×tamp=1592294092&voice_format=1&voice_id=RnKu9FODFHK5FPpsrN&signature=HepdTRX6u155qIPKNKC%2B3U0j1N0%3D
OpusHead (4 Byte) | Frame Data Length (2 Byte) | Opus Frame Compressed Data |
opus | length | length of opus decode data |
{"code":0,"message":"success","voice_id":"RnKu9FODFHK5FPpsrN"}
{"type": "end"}
{"code":0,"message":"success","voice_id":"RnKu9FODFHK5FPpsrN","message_id":"RnKu9FODFHK5FPpsrN_11_0","result":{"slice_type":0,"index":0,"start_time":0,"end_time":1240,"voice_text_str":"real time","word_size":0,"word_list":[]}}
{"code":0,"message":"success","voice_id":"RnKu9FODFHK5FPpsrN","message_id":"RnKu9FODFHK5FPpsrN_33_0","result":{"slice_type":2,"index":0,"start_time":0,"end_time":2840,"voice_text_str":"real-time speech recognition","word_size":0,"word_list":[]}}
{"code":0,"message":"success","voice_id":"CzhjnqBkv8lk5pRUxhpX","message_id":"CzhjnqBkv8lk5pRUxhpX_241","final":1}
{"code":4008,"message":"Background recognition server audio fragment waiting timeout","voice_id":"CzhjnqBkv8lk5pRUxhpX","message_id":"CzhjnqBkv8lk5pRUxhpX_241"}
Value | Description |
4001 | Invalid parameters, see the message for details. |
4002 | Authentication Failure |
4003 | Service not activated. Please activate the service in the console. |
4004 | No free quota available |
4005 | Service stop due to account arrears, please top up promptly |
4006 | Account call concurrency reached the upper limit |
4007 | Audio decoding failed. Please check the format of uploaded audio data matches the API call parameters. |
4008 | Client data upload timeout |
4009 | Client connection disconnected |
4010 | Client upload unknown text message |
5000 | backend error, retry |
5001 | background recognition server recognition failure, retry |
5002 | background recognition server recognition failure, retry |
Feedback