tencent cloud

Real-Time Speech Recognition (WebSocket)
Last updated: 2025-07-29 14:46:16
Real-Time Speech Recognition (WebSocket)
Last updated: 2025-07-29 14:46:16
Note:
Note: This API is version 2.0, and its parameter style, error code, etc. are different from version 3.0. Please be informed.

API Description

This API service uses the WebSocket protocol to recognize real-time audio streams and synchronously return recognition results, achieving real-time speech-to-text.
Before using this API, activate service in the speech recognition console, go to the API key management page to create key and generate appid, secretid, and secretkey for signature generation during API call. The signature will be used for API authentication.

API Requirements

When integrating the real-time speech recognition API, follow the requirements below.
Content
Description
Language type
Supports Mandarin, Cantonese, English, Korean, Japanese, Thai, Indonesian, Malay, and Arabic. The corresponding language type can be set through the API parameter engine_model_type.
Supports industries
Common, finance, gaming, education, health care
audio attribute
Sampling Rate: 16000 Hz or 8000 Hz
Sampling Accuracy: 16 bits
Sound channel: mono
Audio format
pcm、wav、opus、speex、silk、mp3、m4a、aac
request protocol
wss protocol
Request URL
wss://asr.cloud.tencent.com/asr/v2/<appid>?{request parameters}
API authentication
signature authentication mechanism. For details, see Signature Generation
response format
Unify and use JSON format
data transmission
It is recommended to send a 40ms duration data packet every 40ms (1:1 real-time rate), corresponding to pcm sizes of 640 bytes at 8k sampling rate and 1280 bytes at 16k sampling rate.
The audio sending rate is too fast, exceeding the 1:1 real-time rate, or the sending interval between audio data packets exceeds 6 seconds, which may cause an engine error. The backend will return an error and actively disconnect.
concurrency limit
The default single account concurrent connection limit is 20. If you need to increase the concurrent limit, submit a ticket for consultation.

API Call Process

The API calling process is divided into two stages: the handshake phase and the recognition phase. Both stages return a text message in the background, with the content being a JSON serialized string. The following is the format description:
Field Name
Type
Description
code
Integer
Status code; 0 indicates normal operation, non-zero values indicate errors.
message
String
Error description. When an error occurs, display the reason for the error occurrence. As the business develops or experience optimization progresses, this text may be frequently updated.
voice_id
String
unique audio stream id, generated by the client during the handshake phase and assigned to the API call parameters
message_id
String
unique message id
result
Result
latest speech recognition result
final
Integer
When this field returns 1, it means the audio stream recognition is completed.
The recognition Result is in struct format:
Field Name
Type
Description
slice_type
Integer
Recognition result type:
0: Start of a Sentence Recognition
1: In the process of sentence recognition, voice_text_str is an Unstable Result (the recognition result may still change).
2: End of sentence recognition, voice_text_str is a Steady-State Result (the recognition result no longer changes).
During audio sending, the slice_type sequence that may be returned during the recognition process includes:
0-1-2: Start of sentence recognition, recognition in progress (may return multiple 1s), recognition completed
0-2: Start of sentence recognition, recognition completed
Return directly the complete recognition result of a paragraph
index
Integer
Sequence number of the current sentence in the entire audio stream, starting from 0 and incrementing sentence by sentence
start_time
Integer
Start time of the current sentence in the audio stream
end_time
Integer
End time of the current sentence in the audio stream
voice_text_str
String
Current paragraph text result, coded as UTF8
word_size
Integer
Number of word results in the current paragraph
word_list
Word Array
Word list of the current sentence, Word Structure Format:
word: String type, content of the word
start_time: Integer type, start time of the word in the entire audio stream
end_time: Integer type, end time of the word in the entire audio stream
stable_flag: Integer type, stable result of the word, 0 indicates the word may change in subsequent recognition, 1 indicates the word will not change in subsequent recognition

Handshake Phase

Request Format

Handshake phase, the client initiates a WebSocket connection request, and the request URL format is:
wss://asr.cloud.tencent.com/asr/v2/<appid>?{request parameters}
<appid> should be replaced with the appid of your Tencent Cloud account, which can be obtained on the API key management page. The format of {request parameters} is:
key1=value2&key2=value2...(URL encode both key and value)
Parameter description:
Parameter Name
Required
Type
Description
secretid
Yes
String
The secretId of your Tencent Cloud account can be obtained on the API key management page.
timestamp
Yes
Integer
Current UNIX timestamp, unit: seconds. If the difference from the current time is too large, it will cause a signature expiration error.
expired
Yes
Integer
Expiration time of the signature UNIX timestamp, in seconds. Expired must be greater than timestamp and expired - timestamp less than 90 days.
nonce
Yes
Integer
Random positive integer. User needs to generate it themselves, up to 10 digits.
engine_model_type
Yes
String
Engine Model Type
Phone call scenario
8k_zh: Chinese telephone common
8k_en: English telephone common
Non-phone call scenario
16k_zh_large: large model engine for Mandarin, Chinese dialects, and English [large model version]. The current model supports language recognition for Chinese, English, and multiple Chinese dialects, has a large number of parameters, and features language model performance enhancement. It greatly improves recognition accuracy against low-quality audio such as loud noise, strong echo, low voice volume, and voice from far away.
16k_zh: Mandarin common
16k_yue: Cantonese
16k_zh-TW: Chinese (Traditional)
16k_ar: Arabic
16k_en: English
16k_ko: Korean
16k_ja: Japanese
16k_th: Thai
16k_id: Indonesian
16k_ms: Malay
voice_id
Yes
String
A 16-character String serves as the unique identifier for each audio, user-generated.
voice_format
No
Int
Voice encoding method, Option, default value is 4. 1: pcm; 4: speex(sp); 6: silk; 8: mp3; 10: opus (Opus format audio stream packaging instructions); 12: wav; 14: m4a (each segment must be a complete m4a audio); 16: aac
needvad
No
Integer
0: Disable vad, 1: Enable vad
If the voice segment exceeds 60 seconds, enable vad (voice detection and segmentation function).
hotword_id
No
String
Hotword table id. If this parameter is not set, the default hotword list will automatically take effect; if this parameter is set, the hotword list will take effect.
reinforce_hotword
No
Integer
Enhanced hotword feature. Default is 0, where 0: not enabled, 1: enable.
When turned on (only supports 8k_zh, 16k_zh), the homophonic replacement function will be enabled. Homophones are configured in hotwords.
For example: After the term "蜜制" is set and the enhancement feature is enabled, recognition results of words with the same pronunciation (mizhi) as "蜜制", such as "秘制" and "蜜汁", will be forcibly replaced with "蜜制". Therefore, it is recommended that customers enable this feature based on their actual situation.
customization_id
No
String
self-learning model id. if this parameter is not set, the last launched self-learning model will take effect automatically; if this parameter is set, the self-learning model will take effect.
filter_dirty
No
Integer
whether to filter profanity (Currently supports Mandarin engine). Default value is 0. 0: not filter profanity; 1: filter dirty words; 2: replace profanity with "*".
filter_modal
No
Integer
whether to filter modal particles (Currently supports Mandarin engine). Default value is 0. 0: do not filter modal particles; 1: partial filtering; 2: strict filtering.
filter_punc
No
Integer
whether to filter periods at the end of sentences (Currently supports Mandarin engine). Default value is 0. 0: does not filter periods at the end of sentences; 1: filters periods at the end of sentences.
filter_empty_result
No
Integer
Callback recognition empty result, default is 1. 0: callback empty result; 1: Do Not Callback Empty Result.
Note: If slice_type=0 and slice_type=2 paired callback is needed, set filter_empty_result=0. Generally needed in outbound call scenarios for paired return, use slice_type=0 to determine whether voice occurs.
convert_num_mode
No
Integer
Whether to perform intelligent conversion of Arabic numerals (Currently supports Mandarin engine). 0: do not convert, directly output Chinese numbers, 1: intelligently convert to Arabic numerals based on the scenario, 3: enable math-related number conversion. Default value is 1.
word_info
No
Int
Whether to display word-level timestamp. 0: do not display; 1: display, excluding punctuation timestamp; 2: display, including punctuation timestamp. Support for engines 8k_en, 8k_zh, 8k_zh_finance, 16k_zh, 16k_en, 16k_ca, 16k_zh-TW, 16k_ja, 16k_wuu-SH. Default is 0.
vad_silence_time
No
Integer
Voice segmentation detection threshold. Silence duration exceeding the threshold will be considered as sentence segmentation (commonly used in customer service scenarios, must be used in conjunction with needvad = 1). Value ranges from 240 to 2000 ms. Do not adjust this parameter arbitrarily as it may affect recognition performance. Currently only supports 8k_zh, 8k_zh_finance, and 16k_zh engine models.
max_speak_time
No
Integer
Forced segmentation feature, value ranges from 5000 to 90000 (unit: ms), default value 0 (not enabled). In continuous speaking without interruption, this parameter will implement forced segmentation (at this point the result changes into steady state, slice_type=2). For example: in gaming commentary scenarios, when the commentator continues uninterrupted commentary and sentence segmentation is unable, set this parameter to 10000 to receive slice_type=2 callbacks every 10 seconds.
noise_threshold
No
Float
Noise parameter threshold, defaults to 0, value ranges from -1 to 1. For some audio clips, the larger the value, the more likely it is determined as noise condition. The smaller the value, the more likely it is determined as voice condition.
Use with caution: may affect recognition accuracy
signature
Yes
String
API signature parameters
hotword_list
No
String
Temporary hot word list: this parameter is used for improve recognition accuracy.
Single hot word limit: "hotword|weight", each hotword no more than 30 characters (maximum 10 Chinese characters), weight 1-11, for example: "Tencent Cloud|5" or "ASR|11";
Restrictions for the temporary term list: multiple terms separated by commas, supports up to 128 hotwords, for example: "Tencent Cloud|10, speech recognition|5, ASR|11";
hotword_id (hot word list) differs from hotword_list (temporary hot word list)
hotword_id: hot word list. You must first create a hot word list on the console or via API, then obtain the corresponding hotword_id as the input parameter to use the hotword function.
hotword_list: temporary hot word list. Each time a request is made, directly enter the temporary hot word list to use the hotword function. The list is not retained on the cloud. Suitable for users with a massive number of hot words demand.

Note:
If both hotword_id and hotword_list are provided, hotword_list will be used first.
When term weight is set to 11, the current term will be upgraded to a super term. It is advisable to only set important and must-effective terms to 11. Setting too many terms with a weight of 11 will affect overall accuracy.
input_sample_rate
No
Integer
pcm format 8k audio can be upsampled to 16k for recognition when the engine sampling rate is mismatched, effectively improving recognition accuracy. Only 8000 is supported. For example, if 8000 is input, the pcm audio sampling rate is 8k. When the engine selects 16k_zh, the 8k pcm audio can be recognized normally under the 16k_zh engine.

Note: This parameter is applicable only to pcm format audio. If no input value is provided, it will maintain the default state, where the default call engine sampling rate equals the pcm audio sample rate.

signature generation

1. Sort all parameters except signature in alphabetical order, concatenate the request URL as the plaintext for generating the signature. Here, take appid=125922*** and secretid=*****Qq1zhZMN8dv0***** as an example to concatenate the signature plaintext. The concatenated signature plaintext is:
asr.cloud.tencent.com/asr/v2/125922**?engine_model_type=16k_zh&expired=1592380492&filter_dirty=1&filter_modal=1&filter_punc=1&needvad=1&nonce=1592294092123&secretid=*****Qq1zhZMN8dv0*****&timestamp=1592294092&voice_format=1&voice_id=RnKu9FODFHK5FPpsrN
2. Encrypt the signature plaintext with secretkey using HmacSha1, then perform base64 encoding. For example, for the signature source in the previous step with secretkey=*****SkqpeHgqmSz*****, encrypt it with the HmacSha1 algorithm and perform base64 encoding:
Base64Encode(HmacSha1("asr.cloud.tencent.com/asr/v2/125922**?engine_model_type=16k_zh&expired=1592380492&filter_dirty=1&filter_modal=1&filter_punc=1&needvad=1&nonce=1592294092123&secretid=*****Qq1zhZMN8dv0*****&timestamp=1592294092&voice_format=1&voice_id=RnKu9FODFHK5FPpsrN""kFpwoX5RYQ2SkqpeHgqmSzHK7h3A2fni"))
The obtained signature value is:
HepdTRX6u155qIPKNKC+3U0j1N0=
3. urlencode the signature value (must urlencode, URL encoding must be performed and special characters such as "+" and "=" must be encoded,otherwise occasional authentication failures may occur), then concatenate to get the final request URL:
wss://asr.cloud.tencent.com/asr/v2/125922***?engine_model_type=16k_zh&expired=1592380492&filter_dirty=1&filter_modal=1&filter_punc=1&needvad=1&nonce=1592294092123&secretid=*****Qq1zhZMN8dv0*****&timestamp=1592294092&voice_format=1&voice_id=RnKu9FODFHK5FPpsrN&signature=HepdTRX6u155qIPKNKC%2B3U0j1N0%3D

Opus Audio Stream Packaging Instructions

Compress FrameSize fixed at 640, meaning compress 640 short once, otherwise decompression will fail. It can be sent to the server as a splice and combine of multiple frames, each frame must meet the following format. The compressed data encapsulation for each frame is as follows:
OpusHead (4 Byte)
Frame Data Length (2 Byte)
Opus Frame Compressed Data
opus
length
length of opus decode data

Request Response

After the client initiates a connection request, the backend establishes a connection and performs signature verification. If the verification is successful, it returns an ACK message with a code value of 0, indicating the handshake is successful. If verification fails, the backend returns a message with a non-zero code and disconnects.
{"code":0,"message":"success","voice_id":"RnKu9FODFHK5FPpsrN"}

Recognition Phase

After the handshake is successful, enter the recognition phase, upload speech data from the client and receive recognition results.

Upload Data

During the recognition process, the client continuously uploads binary messages to the backend, containing audio stream binary data. It is recommended to send a data packet every 40ms with a duration of 40ms (i.e., a 1:1 real-time rate). The corresponding pcm size is 640 bytes for an 8k sampling rate and 1280 bytes for a 16k sampling rate. If the audio sending rate is too fast, exceeding the 1:1 real-time rate, or the sending interval between audio data packets exceeds 6 seconds, it may cause an engine error. The backend will return an error and actively disconnect. After the audio stream upload is complete, the client must send a text message with the following content to notify the backend to end recognition.
{"type": "end"}

Message Reception

During client upload, receive real-time recognition results from the backend. Example:
{"code":0,"message":"success","voice_id":"RnKu9FODFHK5FPpsrN","message_id":"RnKu9FODFHK5FPpsrN_11_0","result":{"slice_type":0,"index":0,"start_time":0,"end_time":1240,"voice_text_str":"real time","word_size":0,"word_list":[]}}
{"code":0,"message":"success","voice_id":"RnKu9FODFHK5FPpsrN","message_id":"RnKu9FODFHK5FPpsrN_33_0","result":{"slice_type":2,"index":0,"start_time":0,"end_time":2840,"voice_text_str":"real-time speech recognition","word_size":0,"word_list":[]}}
After background recognition of ALL uploaded speech data, it ultimately returns a message with final value 1 and disconnects.
{"code":0,"message":"success","voice_id":"CzhjnqBkv8lk5pRUxhpX","message_id":"CzhjnqBkv8lk5pRUxhpX_241","final":1}
If an error occurs during the recognition process, the backend returns a message with a non-zero code and disconnects.
{"code":4008,"message":"Background recognition server audio fragment waiting timeout","voice_id":"CzhjnqBkv8lk5pRUxhpX","message_id":"CzhjnqBkv8lk5pRUxhpX_241"}

Developer Resources

SDK

Example of SDK Call

Error Code

Value
Description
4001
Invalid parameters, see the message for details.
4002
Authentication Failure
4003
Service not activated. Please activate the service in the console.
4004
No free quota available
4005
Service stop due to account arrears, please top up promptly
4006
Account call concurrency reached the upper limit
4007
Audio decoding failed. Please check the format of uploaded audio data matches the API call parameters.
4008
Client data upload timeout
4009
Client connection disconnected
4010
Client upload unknown text message
5000
backend error, retry
5001
background recognition server recognition failure, retry
5002
background recognition server recognition failure, retry
Was this page helpful?
You can also Contact Sales or Submit a Ticket for help.
Yes
No

Feedback