tencent cloud

All product documents
APIs
Real-Time Speech Recognition (WebSocket)
Last updated: 2024-11-28 11:00:31
Real-Time Speech Recognition (WebSocket)
Last updated: 2024-11-28 11:00:31
Note:
This interface is version API 2.0. It differs from version API 3.0 in terms of parameter style, error codes, etc. Please be aware.

Interface description

This API service utilizes the websocket protocol to recognize real-time audio streams and synchronously return recognition results, achieving a "speak and text appears" effect. Before using this API, you need to activate the service in the ASR console and go to the API Key Management page to create a new key, generating an AppID, SecretID, and SecretKey for generating signatures during API calls. The signature will be used for interface authentication.

Interface requirements

To integrate the real-time ASR API, follow these requirements.
Content
Description
Language types
Supports Mandarin, Cantonese, English, Korean, Japanese, Thai, Indonesian, Malay, Arabic, etc. The corresponding language type can be set through the interface parameter engine_model_type
Supported industries
General, Finance, Gaming, Education, Medical
Audio properties
Sampling Rate: 16000Hz or 8000Hz
Sampling Accuracy: 16bits
Audio Track: Mono
Audio format
pcm,wav,opus,speex,silk,mp3,m4a,aac
Request protocol
wss Protocol
Request address
wss://asr.cloud.tencent.com/asr/v2/<appid>?{request parameters}
Interface Authentication
Signature authentication mechanism, see Signature generation
Response Format
Unified JSON format
Data Transmission
It is recommended to send data packets with a duration of 40ms every 40ms (i.e., 1:1 real-time rate). For PCM, the data size is: 640 bytes at 8k sampling rate, 1280 bytes at 16k sampling rate
Audio transmission rate that is too fast (exceeds 1:1 real-time rate) or an interval between audio packets exceeding 6 seconds may cause engine errors, and the backend will return an error and proactively disconnect the connection
Concurrency Limitation
By default, the concurrent connection limit for a single account is set to 20. If you need to increase this limit, please submit a ticket for consultation.

API call process

The interface call process is divided into two stages: handshake phase and recognition phase. The backend returns a text message in both stages, in JSON serialized string format. The format is as follows:
Field name
Type
Description
code
Integer
Status Codes: 0 means normal, non-zero values indicate an error occurred
message
String
Error Description: Displays the specific reason for the error. This text may be frequently updated or changed as the business develops or to optimize the user experience
voice_id
String
Unique Audio Stream ID: Generated by the client during the handshake phase and assigned in the call parameters
message_id
String
Unique message ID
result
Result
Latest ASR Result
final
Integer
When this field returns 1, it indicates that the audio stream recognition is complete
The recognition result is in Result structure format:
Field name
Type
Description
slice_type
Integer
Recognition Result Type:
0: Start of a new paragraph recognition
1: Recognizing a paragraph, voice_text_str is a non-steady-state result (the recognition result of this paragraph may change)
2: Recognition of a paragraph completed, voice_text_str is a steady-state result (the recognition result of this paragraph will no longer change)
Based on the transmitted audio, the possible slice_type sequence that may return during recognition includes:
0-1-2: Start of a new paragraph recognition, in-progress recognition (multiple returns of 1 possible), recognition completed
0-2: Start of a new paragraph recognition, recognition completed
2: Directly return the complete recognition result of a paragraph
index
Integer
The current paragraph result's sequence number in the entire audio stream, incrementing from 0 for each sentence
start_time
Integer
The start time of the current paragraph result in the entire audio stream
end_time
Integer
The end time of the current paragraph result in the entire audio stream
voice_text_str
String
The text result of the current paragraph, encoded in UTF-8
word_size
Integer
The number of words in the current paragraph result
word_list
Word Array
The word list for the current paragraph, Word structure format is:
word: String type, the content of the word
start_time: Integer type, the start time of the word in the entire audio stream
end_time: Integer type, the end time of the word in the entire audio stream
stable_flag: Integer type, the steady-state result of the word. 0 indicates that the word may change in subsequent recognition, 1 indicates that the word will not change in subsequent recognition

Handshake phase

Request format

Handshake phase, the client actively initiates a websocket connection request, the request URL format is:
wss://asr.cloud.tencent.com/asr/v2/<appid>?{request parameters}
Replace <appid> with the AppID of your Tencent Cloud registered account, which can be retrieved from the API Key Management Page. The format of {request parameters} is
key1=value1&key2=value2...(both key and value need to be URL encoded)
Parameter description:
Parameter name
Required
Type
Description
secretid
Yes
String
SecretId of the Tencent Cloud registered account, can be obtained through the API Key Management Page
timestamp
Yes
Integer
Current UNIX timestamp in seconds. If the difference with the current time is too large, it will cause a signature expiration error
expired
Yes
Integer
The UNIX timestamp of the signature's expiration time, in seconds. expired must be greater than timestamp and expired - timestamp less than 90 days
nonce
Yes
Integer
Random positive integer. Users need to generate it themselves, up to 10 digits
engine_model_type
Yes
String
Engine Model Type
Telephone Scenario:
• 8k_zh: Chinese Telephone General;
• 8k_en: English Telephone General;
Non-Telephone Scenario:
• 16k_zh_large: General English large model engine [large model version]. The current model supports recognition of Chinese, English, various Chinese dialects, etc., with a large number of model parameters, enhanced language model performance, and greatly improved recognition accuracy for low-quality audio such as high noise, high echo, low human voice, and distant human voice;
• 16k_zh: Mandarin Chinese General;
• 16k_yue: Cantonese;
• 16k_zh-TW: Traditional Chinese;
• 16k_ar: Arabic;
• 16k_en: English;
• 16k_ko: Korean;
• 16k_ja: Japanese;
• 16k_th: Thai;
• 16k_id: Indonesian;
• 16k_ms: Malay;
voice_id
Yes
String
16-bit String as a unique identifier for each audio, generated by the user
voice_format
No
Int
Speech Encoding method, optional, default value is 4.1:pcm;4:speex(sp);6:silk;8:mp3;10:opus(opus format audio stream packaging description);12:wav;14:m4a(each fragment must be a complete m4a audio);16:aac
needvad
No
Integer
0: disable vad, 1: enable vad
If the audio fragment length exceeds 60 seconds, users need to enable vad (voice activity detection feature)
hotword_id
No
String
Hotword list id. If this parameter is not set, the default hotword list will automatically take effect. If this parameter is set, the corresponding hotword list will take effect
reinforce_hotword
No
Integer
Hotword enhancement feature. Default is 0, 0: disabled, 1: enabled.
After enabling (only supported for 8k_zh, 16k_zh), the homophonic substitution feature will be activated, replacing homophones and words in the hotword list.
For example: After setting the hotword "蜜制" and enabling the enhancement feature, recognition results of words with the same pronunciation (mizhi) as "蜜制", such as "秘制", "蜜汁", will be forcibly replaced with "蜜制". Therefore, it is recommended that customers enable this feature based on their actual situation.
customization_id
No
String
Self-learning model id. If this parameter is not set, the last self-learning model to go online will automatically take effect. If this parameter is set, the corresponding self-learning model will take effect
filter_dirty
No
Integer
Whether to filter profanity (currently supports Mandarin Chinese engine). The default is 0. 0: Do not filter profanity; 1: Filter profanity; 2: Replace profanity with " * "
filter_modal
No
Integer
Whether to filter modal particles (currently supports Mandarin Chinese engine). The default is 0. 0: Do not filter modal particles; 1: Partially filter; 2: Strictly filter
filter_punc
No
Integer
Whether to filter sentence-ending periods (currently supports Mandarin Chinese engine). The default is 0. 0: Do not filter sentence-ending periods; 1: Filter sentence-ending periods
filter_empty_result
No
Integer
Whether to callback for empty results, default is 1.0: callback empty results; 1: do not callback empty results;
Note: If you need to pair callbacks with slice_type=0 and slice_type=2, you need to set filter_empty_result=0. Pairing returns is typically required in outbound scenarios, and slice_type=0 is used to determine the presence of human voice.
convert_num_mode
No
Integer
Whether to perform intelligent conversion of Arabic numerals (currently supported by the Mandarin Chinese engine). 0: do not convert, output Chinese numerals directly, 1: intelligently convert to Arabic numerals according to the scenario, 3: enable math-related digit conversion. Default is 1
word_info
No
Int
Whether to display word-level timestamps. 0: do not display; 1: display, do not include punctuation timestamps, 2: display, include punctuation timestamps. Supported engines: 8k_en, 8k_zh, 8k_zh_finance, 16k_zh, 16k_en, 16k_ca, 16k_zh-TW, 16k_ja, 16k_wuu-SH, default is 0
vad_silence_time
No
Integer
Speech segmentation detection threshold. Silence duration exceeding this threshold will be considered a break (mainly used in intelligent customer service scenarios, needs to be used with needvad = 1). The value range is 240-2000 ms. It is recommended not to adjust this parameter arbitrarily as it may affect recognition results. Currently, it is only supported by the 8k_zh, 8k_zh_finance, and 16k_zh engine models
max_speak_time
No
Integer
Forced segmentation feature, value range 5000-90000 (unit: milliseconds), default value is 0 (not enabled). In the case of continuous speech without interruption, this parameter will enforce segmentation (the result becomes stable, slice_type=2). For example: In a game commentary scenario, if the commentator continues to speak without interruption and cannot segment the speech, setting this parameter to 10000 will receive a slice_type=2 callback every 10 seconds.
noise_threshold
No
Float
Noise parameter threshold, default is 0, range: [-1,1]. For some audio segments, the higher the value, the more likely it is to be detected as noise. The lower the value, the more likely it is to be detected as human voice.
Use with caution: may affect recognition accuracy
signature
Yes
String
API signature parameter
hotword_list
No
String
Temporary hotword list: This parameter is used to improve recognition accuracy.
Single hotword limit: "hotword|weight", each hotword should not exceed 30 characters (maximum 10 Chinese characters), weight ranges from 1-11, e.g., "Tencent Cloud|5" or "ASR|11";
Temporary hotword list limit: Multiple hotwords are separated by English commas, with a maximum of 128 hotwords supported, e.g., "Tencent Cloud|10, ASR|5, ASR|11";
Difference between the parameters hotword_id (hotword list) and hotword_list (temporary hotword list):
hotword_id: hotword list. You need to create a hotword list in the console or via the API to obtain the corresponding hotword_id to use the hotword feature;
hotword_list: temporary hotword list. Directly pass in the temporary hotword list for each request to use the hotword feature. The cloud does not retain the temporary hotword list. Suitable for users with a large demand for hotwords;

Note:
If both hotword_id and hotword_list are passed in, hotword_list will be used first;
When the hotword weight is set to 11, the current hotword will be upgraded to a super hotword. It is recommended to set only important and must-be-effective hotwords to 11. Setting too many hotwords with a weight of 11 will affect the overall word accuracy rate.
input_sample_rate
No
Interge
Supports 8k audio in pcm format to be upsampled to 16k when the sampling rate does not match the engine, effectively improving recognition accuracy. Only supports: 8000. For example: if 8000 is passed in, then the pcm audio sampling rate is 8k. When the engine is selected to 16k_zh, the 8k sampling rate of the pcm audio can be recognized normally under the 16k_zh engine.
Note: this parameter is only applicable to pcm format audio. If no value is passed in, the default state will be maintained, which means the default engine sampling rate is equal to the pcm audio sampling rate.
signature generation
1. Sort all parameters except signature in dictionary order and concatenate the request URL as the original signature text. Here, using Appid=1259228442, SecretId=AKIDoQq1zhZMN8dv0psmvud6OUKuGPO7pu0r as an example to concatenate the original signature text, the concatenated original signature text is:
asr.cloud.tencent.com/asr/v2/1259228442?engine_model_type=16k_zh&expired=1592380492&filter_dirty=1&filter_modal=1&filter_punc=1&needvad=1&nonce=1592294092123&secretid=AKIDoQq1zhZMN8dv0psmvud6OUKuGPO7pu0r&timestamp=1592294092&voice_format=1&voice_id=RnKu9FODFHK5FPpsrN
2. Use SecretKey to encrypt the signature plaintext with HmacSha1, then perform base64 encoding. For example, for the signature plaintext of Back, SecretKey=kFpwoX5RYQ2SkqpeHgqmSzHK7h3A2fni, use the HmacSha1 algorithm for encryption and base64 encoding processing:
Base64Encode(HmacSha1("asr.cloud.tencent.com/asr/v2/1259228442?engine_model_type=16k_zh&expired=1592380492&filter_dirty=1&filter_modal=1&filter_punc=1&needvad=1&nonce=1592294092123&secretid=AKIDoQq1zhZMN8dv0psmvud6OUKuGPO7pu0r&timestamp=1592294092&voice_format=1&voice_id=RnKu9FODFHK5FPpsrN", "kFpwoX5RYQ2SkqpeHgqmSzHK7h3A2fni"))
The obtained signature value is:
HepdTRX6u155qIPKNKC+3U0j1N0=
3. URL encode the signature value (urlencoding is necessary, otherwise occasional authentication failures may occur ) and concatenate to get the final request URL:
wss://asr.cloud.tencent.com/asr/v2/1259228442?engine_model_type=16k_zh&expired=1592380492&filter_dirty=1&filter_modal=1&filter_punc=1&needvad=1&nonce=1592294092123&secretid=AKIDoQq1zhZMN8dv0psmvud6OUKuGPO7pu0r&timestamp=1592294092&voice_format=1&voice_id=RnKu9FODFHK5FPpsrN&signature=HepdTRX6u155qIPKNKC%2B3U0j1N0%3D

Opus audio stream encapsulation instructions

The compressed FrameSize is fixed at 640, that is, 640 shorts are compressed at a time, otherwise decompression will fail. The data sent to the server can be a combination of multiple frames, each frame must meet the following format. Each frame of compressed data is encapsulated as follows:
OpusHead (4 bytes)
Frame data length (2 bytes)
An Opus frame of compressed data
opus
Length len
Corresponding opus decode data of length len

Request Response

After the client initiates a connection request, the background establishes the connection and performs signature verification. If the verification is successful, a confirmation message with a code value of 0 indicating a successful handshake is returned. If the verification fails, the background returns a message with a non-zero code value and disconnects.
{"code":0,"message":"success","voice_id":"RnKu9FODFHK5FPpsrN"}

Recognition Stage

After the handshake is successful, enter the recognition stage. The client uploads voice data and receives recognition result messages.

Upload Data

During the recognition process, the client continuously uploads binary messages to the background, which contain binary data of the audio stream. It is recommended to send 40ms duration (i.e., 1:1 real-time rate) data packets every 40ms, corresponding to PCM sizes of 640 bytes for an 8k sampling rate and 1280 bytes for a 16k sampling rate. If the audio sending rate is too fast, exceeding the 1:1 real-time rate, or the interval between audio data packets exceeds 6 seconds, it may cause an engine error. The background will return an error and proactively disconnect. After the audio stream upload is complete, the client needs to send a text message with the following content to notify the background to end recognition.
{"type": "end"}

Receiving messages

During the client's data upload process, it needs to simultaneously receive real-time recognition results returned by the background. Example results:
{"code":0,"message":"success","voice_id":"RnKu9FODFHK5FPpsrN","message_id":"RnKu9FODFHK5FPpsrN_11_0","result":{"slice_type":0,"index":0,"start_time":0,"end_time":1240,"voice_text_str":"real-time","word_size":0,"word_list":[]}}
{"code":0,"message":"success","voice_id":"RnKu9FODFHK5FPpsrN","message_id":"RnKu9FODFHK5FPpsrN_33_0","result":{"slice_type":2,"index":0,"start_time":0,"end_time":2840,"voice_text_str":"real-time ASR","word_size":0,"word_list":[]}}
After the background recognition of all uploaded voice data is complete, a final message with a value of 1 is returned and the connection is disconnected.
{"code":0,"message":"success","voice_id":"CzhjnqBkv8lk5pRUxhpX","message_id":"CzhjnqBkv8lk5pRUxhpX_241","final":1}
If an error occurs during recognition, the background returns a message with a code of non-zero value and disconnects the connection.
{"code":4008,"message":"background recognition server audio shard wait timeout","voice_id":"CzhjnqBkv8lk5pRUxhpX","message_id":"CzhjnqBkv8lk5pRUxhpX_241"}

Developer Resources

SDK

SDK Invocation Example

Error code

Value
Description
4001
Invalid Parameter, see message for details
4002
Authentication failed
4003
AppID Service Not Activated, please activate the service in the console
4004
No available free quota
4005
Account arrears. Service stopped, please recharge in time
4006
The account's concurrent calls limit is exceeded
4007
Audio decoding failed. Please check that the uploaded audio data format is consistent with the call parameters
4008
Client data upload timeout
4009
Client connection closed
4010
Client uploaded an unknown text message
5000
Background error, please try again
5001
Background recognition server recognition failure, please try again
5002
Background recognition server recognition failure, please try again
Was this page helpful?
You can also Contact Sales or Submit a Ticket for help.
Yes
No

Feedback

Contact Us

Contact our sales team or business advisors to help your business.

Technical Support

Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

7x24 Phone Support