tencent cloud

Feedback

Real-Time Speech Recognition (WebSocket)

Last updated: 2024-11-28 11:00:31
    Note:
    This interface is version API 2.0. It differs from version API 3.0 in terms of parameter style, error codes, etc. Please be aware.

    Interface description

    This API service utilizes the websocket protocol to recognize real-time audio streams and synchronously return recognition results, achieving a "speak and text appears" effect. Before using this API, you need to activate the service in the ASR console and go to the API Key Management page to create a new key, generating an AppID, SecretID, and SecretKey for generating signatures during API calls. The signature will be used for interface authentication.

    Interface requirements

    To integrate the real-time ASR API, follow these requirements.
    Content
    Description
    Language types
    Supports Mandarin, Cantonese, English, Korean, Japanese, Thai, Indonesian, Malay, Arabic, etc. The corresponding language type can be set through the interface parameter engine_model_type
    Supported industries
    General, Finance, Gaming, Education, Medical
    Audio properties
    Sampling Rate: 16000Hz or 8000Hz
    Sampling Accuracy: 16bits
    Audio Track: Mono
    Audio format
    pcm,wav,opus,speex,silk,mp3,m4a,aac
    Request protocol
    wss Protocol
    Request address
    wss://asr.cloud.tencent.com/asr/v2/<appid>?{request parameters}
    Interface Authentication
    Signature authentication mechanism, see Signature generation
    Response Format
    Unified JSON format
    Data Transmission
    It is recommended to send data packets with a duration of 40ms every 40ms (i.e., 1:1 real-time rate). For PCM, the data size is: 640 bytes at 8k sampling rate, 1280 bytes at 16k sampling rate
    Audio transmission rate that is too fast (exceeds 1:1 real-time rate) or an interval between audio packets exceeding 6 seconds may cause engine errors, and the backend will return an error and proactively disconnect the connection
    Concurrency Limitation
    By default, the concurrent connection limit for a single account is set to 20. If you need to increase this limit, please submit a ticket for consultation.

    API call process

    The interface call process is divided into two stages: handshake phase and recognition phase. The backend returns a text message in both stages, in JSON serialized string format. The format is as follows:
    Field name
    Type
    Description
    code
    Integer
    Status Codes: 0 means normal, non-zero values indicate an error occurred
    message
    String
    Error Description: Displays the specific reason for the error. This text may be frequently updated or changed as the business develops or to optimize the user experience
    voice_id
    String
    Unique Audio Stream ID: Generated by the client during the handshake phase and assigned in the call parameters
    message_id
    String
    Unique message ID
    result
    Result
    Latest ASR Result
    final
    Integer
    When this field returns 1, it indicates that the audio stream recognition is complete
    The recognition result is in Result structure format:
    Field name
    Type
    Description
    slice_type
    Integer
    Recognition Result Type:
    0: Start of a new paragraph recognition
    1: Recognizing a paragraph, voice_text_str is a non-steady-state result (the recognition result of this paragraph may change)
    2: Recognition of a paragraph completed, voice_text_str is a steady-state result (the recognition result of this paragraph will no longer change)
    Based on the transmitted audio, the possible slice_type sequence that may return during recognition includes:
    0-1-2: Start of a new paragraph recognition, in-progress recognition (multiple returns of 1 possible), recognition completed
    0-2: Start of a new paragraph recognition, recognition completed
    2: Directly return the complete recognition result of a paragraph
    index
    Integer
    The current paragraph result's sequence number in the entire audio stream, incrementing from 0 for each sentence
    start_time
    Integer
    The start time of the current paragraph result in the entire audio stream
    end_time
    Integer
    The end time of the current paragraph result in the entire audio stream
    voice_text_str
    String
    The text result of the current paragraph, encoded in UTF-8
    word_size
    Integer
    The number of words in the current paragraph result
    word_list
    Word Array
    The word list for the current paragraph, Word structure format is:
    word: String type, the content of the word
    start_time: Integer type, the start time of the word in the entire audio stream
    end_time: Integer type, the end time of the word in the entire audio stream
    stable_flag: Integer type, the steady-state result of the word. 0 indicates that the word may change in subsequent recognition, 1 indicates that the word will not change in subsequent recognition

    Handshake phase

    Request format

    Handshake phase, the client actively initiates a websocket connection request, the request URL format is:
    wss://asr.cloud.tencent.com/asr/v2/<appid>?{request parameters}
    Replace <appid> with the AppID of your Tencent Cloud registered account, which can be retrieved from the API Key Management Page. The format of {request parameters} is
    key1=value1&key2=value2...(both key and value need to be URL encoded)
    Parameter description:
    Parameter name
    Required
    Type
    Description
    secretid
    Yes
    String
    SecretId of the Tencent Cloud registered account, can be obtained through the API Key Management Page
    timestamp
    Yes
    Integer
    Current UNIX timestamp in seconds. If the difference with the current time is too large, it will cause a signature expiration error
    expired
    Yes
    Integer
    The UNIX timestamp of the signature's expiration time, in seconds. expired must be greater than timestamp and expired - timestamp less than 90 days
    nonce
    Yes
    Integer
    Random positive integer. Users need to generate it themselves, up to 10 digits
    engine_model_type
    Yes
    String
    Engine Model Type
    Telephone Scenario:
    • 8k_zh: Chinese Telephone General;
    • 8k_en: English Telephone General;
    Non-Telephone Scenario:
    • 16k_zh_large: General English large model engine [large model version]. The current model supports recognition of Chinese, English, various Chinese dialects, etc., with a large number of model parameters, enhanced language model performance, and greatly improved recognition accuracy for low-quality audio such as high noise, high echo, low human voice, and distant human voice;
    • 16k_zh: Mandarin Chinese General;
    • 16k_yue: Cantonese;
    • 16k_zh-TW: Traditional Chinese;
    • 16k_ar: Arabic;
    • 16k_en: English;
    • 16k_ko: Korean;
    • 16k_ja: Japanese;
    • 16k_th: Thai;
    • 16k_id: Indonesian;
    • 16k_ms: Malay;
    voice_id
    Yes
    String
    16-bit String as a unique identifier for each audio, generated by the user
    voice_format
    No
    Int
    Speech Encoding method, optional, default value is 4.1:pcm;4:speex(sp);6:silk;8:mp3;10:opus(opus format audio stream packaging description);12:wav;14:m4a(each fragment must be a complete m4a audio);16:aac
    needvad
    No
    Integer
    0: disable vad, 1: enable vad
    If the audio fragment length exceeds 60 seconds, users need to enable vad (voice activity detection feature)
    hotword_id
    No
    String
    Hotword list id. If this parameter is not set, the default hotword list will automatically take effect. If this parameter is set, the corresponding hotword list will take effect
    reinforce_hotword
    No
    Integer
    Hotword enhancement feature. Default is 0, 0: disabled, 1: enabled.
    After enabling (only supported for 8k_zh, 16k_zh), the homophonic substitution feature will be activated, replacing homophones and words in the hotword list.
    For example: After setting the hotword "蜜制" and enabling the enhancement feature, recognition results of words with the same pronunciation (mizhi) as "蜜制", such as "秘制", "蜜汁", will be forcibly replaced with "蜜制". Therefore, it is recommended that customers enable this feature based on their actual situation.
    customization_id
    No
    String
    Self-learning model id. If this parameter is not set, the last self-learning model to go online will automatically take effect. If this parameter is set, the corresponding self-learning model will take effect
    filter_dirty
    No
    Integer
    Whether to filter profanity (currently supports Mandarin Chinese engine). The default is 0. 0: Do not filter profanity; 1: Filter profanity; 2: Replace profanity with " * "
    filter_modal
    No
    Integer
    Whether to filter modal particles (currently supports Mandarin Chinese engine). The default is 0. 0: Do not filter modal particles; 1: Partially filter; 2: Strictly filter
    filter_punc
    No
    Integer
    Whether to filter sentence-ending periods (currently supports Mandarin Chinese engine). The default is 0. 0: Do not filter sentence-ending periods; 1: Filter sentence-ending periods
    filter_empty_result
    No
    Integer
    Whether to callback for empty results, default is 1.0: callback empty results; 1: do not callback empty results;
    Note: If you need to pair callbacks with slice_type=0 and slice_type=2, you need to set filter_empty_result=0. Pairing returns is typically required in outbound scenarios, and slice_type=0 is used to determine the presence of human voice.
    convert_num_mode
    No
    Integer
    Whether to perform intelligent conversion of Arabic numerals (currently supported by the Mandarin Chinese engine). 0: do not convert, output Chinese numerals directly, 1: intelligently convert to Arabic numerals according to the scenario, 3: enable math-related digit conversion. Default is 1
    word_info
    No
    Int
    Whether to display word-level timestamps. 0: do not display; 1: display, do not include punctuation timestamps, 2: display, include punctuation timestamps. Supported engines: 8k_en, 8k_zh, 8k_zh_finance, 16k_zh, 16k_en, 16k_ca, 16k_zh-TW, 16k_ja, 16k_wuu-SH, default is 0
    vad_silence_time
    No
    Integer
    Speech segmentation detection threshold. Silence duration exceeding this threshold will be considered a break (mainly used in intelligent customer service scenarios, needs to be used with needvad = 1). The value range is 240-2000 ms. It is recommended not to adjust this parameter arbitrarily as it may affect recognition results. Currently, it is only supported by the 8k_zh, 8k_zh_finance, and 16k_zh engine models
    max_speak_time
    No
    Integer
    Forced segmentation feature, value range 5000-90000 (unit: milliseconds), default value is 0 (not enabled). In the case of continuous speech without interruption, this parameter will enforce segmentation (the result becomes stable, slice_type=2). For example: In a game commentary scenario, if the commentator continues to speak without interruption and cannot segment the speech, setting this parameter to 10000 will receive a slice_type=2 callback every 10 seconds.
    noise_threshold
    No
    Float
    Noise parameter threshold, default is 0, range: [-1,1]. For some audio segments, the higher the value, the more likely it is to be detected as noise. The lower the value, the more likely it is to be detected as human voice.
    Use with caution: may affect recognition accuracy
    signature
    Yes
    String
    API signature parameter
    hotword_list
    No
    String
    Temporary hotword list: This parameter is used to improve recognition accuracy.
    Single hotword limit: "hotword|weight", each hotword should not exceed 30 characters (maximum 10 Chinese characters), weight ranges from 1-11, e.g., "Tencent Cloud|5" or "ASR|11";
    Temporary hotword list limit: Multiple hotwords are separated by English commas, with a maximum of 128 hotwords supported, e.g., "Tencent Cloud|10, ASR|5, ASR|11";
    Difference between the parameters hotword_id (hotword list) and hotword_list (temporary hotword list):
    hotword_id: hotword list. You need to create a hotword list in the console or via the API to obtain the corresponding hotword_id to use the hotword feature;
    hotword_list: temporary hotword list. Directly pass in the temporary hotword list for each request to use the hotword feature. The cloud does not retain the temporary hotword list. Suitable for users with a large demand for hotwords;
    
    Note:
    If both hotword_id and hotword_list are passed in, hotword_list will be used first;
    When the hotword weight is set to 11, the current hotword will be upgraded to a super hotword. It is recommended to set only important and must-be-effective hotwords to 11. Setting too many hotwords with a weight of 11 will affect the overall word accuracy rate.
    input_sample_rate
    No
    Interge
    Supports 8k audio in pcm format to be upsampled to 16k when the sampling rate does not match the engine, effectively improving recognition accuracy. Only supports: 8000. For example: if 8000 is passed in, then the pcm audio sampling rate is 8k. When the engine is selected to 16k_zh, the 8k sampling rate of the pcm audio can be recognized normally under the 16k_zh engine.
    Note: this parameter is only applicable to pcm format audio. If no value is passed in, the default state will be maintained, which means the default engine sampling rate is equal to the pcm audio sampling rate.
    signature generation
    1. Sort all parameters except signature in dictionary order and concatenate the request URL as the original signature text. Here, using Appid=1259228442, SecretId=AKIDoQq1zhZMN8dv0psmvud6OUKuGPO7pu0r as an example to concatenate the original signature text, the concatenated original signature text is:
    asr.cloud.tencent.com/asr/v2/1259228442?engine_model_type=16k_zh&expired=1592380492&filter_dirty=1&filter_modal=1&filter_punc=1&needvad=1&nonce=1592294092123&secretid=AKIDoQq1zhZMN8dv0psmvud6OUKuGPO7pu0r&timestamp=1592294092&voice_format=1&voice_id=RnKu9FODFHK5FPpsrN
    2. Use SecretKey to encrypt the signature plaintext with HmacSha1, then perform base64 encoding. For example, for the signature plaintext of Back, SecretKey=kFpwoX5RYQ2SkqpeHgqmSzHK7h3A2fni, use the HmacSha1 algorithm for encryption and base64 encoding processing:
    Base64Encode(HmacSha1("asr.cloud.tencent.com/asr/v2/1259228442?engine_model_type=16k_zh&expired=1592380492&filter_dirty=1&filter_modal=1&filter_punc=1&needvad=1&nonce=1592294092123&secretid=AKIDoQq1zhZMN8dv0psmvud6OUKuGPO7pu0r&timestamp=1592294092&voice_format=1&voice_id=RnKu9FODFHK5FPpsrN", "kFpwoX5RYQ2SkqpeHgqmSzHK7h3A2fni"))
    The obtained signature value is:
    HepdTRX6u155qIPKNKC+3U0j1N0=
    3. URL encode the signature value (urlencoding is necessary, otherwise occasional authentication failures may occur ) and concatenate to get the final request URL:
    wss://asr.cloud.tencent.com/asr/v2/1259228442?engine_model_type=16k_zh&expired=1592380492&filter_dirty=1&filter_modal=1&filter_punc=1&needvad=1&nonce=1592294092123&secretid=AKIDoQq1zhZMN8dv0psmvud6OUKuGPO7pu0r&timestamp=1592294092&voice_format=1&voice_id=RnKu9FODFHK5FPpsrN&signature=HepdTRX6u155qIPKNKC%2B3U0j1N0%3D

    Opus audio stream encapsulation instructions

    The compressed FrameSize is fixed at 640, that is, 640 shorts are compressed at a time, otherwise decompression will fail. The data sent to the server can be a combination of multiple frames, each frame must meet the following format. Each frame of compressed data is encapsulated as follows:
    OpusHead (4 bytes)
    Frame data length (2 bytes)
    An Opus frame of compressed data
    opus
    Length len
    Corresponding opus decode data of length len

    Request Response

    After the client initiates a connection request, the background establishes the connection and performs signature verification. If the verification is successful, a confirmation message with a code value of 0 indicating a successful handshake is returned. If the verification fails, the background returns a message with a non-zero code value and disconnects.
    {"code":0,"message":"success","voice_id":"RnKu9FODFHK5FPpsrN"}

    Recognition Stage

    After the handshake is successful, enter the recognition stage. The client uploads voice data and receives recognition result messages.

    Upload Data

    During the recognition process, the client continuously uploads binary messages to the background, which contain binary data of the audio stream. It is recommended to send 40ms duration (i.e., 1:1 real-time rate) data packets every 40ms, corresponding to PCM sizes of 640 bytes for an 8k sampling rate and 1280 bytes for a 16k sampling rate. If the audio sending rate is too fast, exceeding the 1:1 real-time rate, or the interval between audio data packets exceeds 6 seconds, it may cause an engine error. The background will return an error and proactively disconnect. After the audio stream upload is complete, the client needs to send a text message with the following content to notify the background to end recognition.
    {"type": "end"}

    Receiving messages

    During the client's data upload process, it needs to simultaneously receive real-time recognition results returned by the background. Example results:
    {"code":0,"message":"success","voice_id":"RnKu9FODFHK5FPpsrN","message_id":"RnKu9FODFHK5FPpsrN_11_0","result":{"slice_type":0,"index":0,"start_time":0,"end_time":1240,"voice_text_str":"real-time","word_size":0,"word_list":[]}}
    {"code":0,"message":"success","voice_id":"RnKu9FODFHK5FPpsrN","message_id":"RnKu9FODFHK5FPpsrN_33_0","result":{"slice_type":2,"index":0,"start_time":0,"end_time":2840,"voice_text_str":"real-time ASR","word_size":0,"word_list":[]}}
    After the background recognition of all uploaded voice data is complete, a final message with a value of 1 is returned and the connection is disconnected.
    {"code":0,"message":"success","voice_id":"CzhjnqBkv8lk5pRUxhpX","message_id":"CzhjnqBkv8lk5pRUxhpX_241","final":1}
    If an error occurs during recognition, the background returns a message with a code of non-zero value and disconnects the connection.
    {"code":4008,"message":"background recognition server audio shard wait timeout","voice_id":"CzhjnqBkv8lk5pRUxhpX","message_id":"CzhjnqBkv8lk5pRUxhpX_241"}

    Developer Resources

    SDK

    SDK Invocation Example

    Error code

    Value
    Description
    4001
    Invalid Parameter, see message for details
    4002
    Authentication failed
    4003
    AppID Service Not Activated, please activate the service in the console
    4004
    No available free quota
    4005
    Account arrears. Service stopped, please recharge in time
    4006
    The account's concurrent calls limit is exceeded
    4007
    Audio decoding failed. Please check that the uploaded audio data format is consistent with the call parameters
    4008
    Client data upload timeout
    4009
    Client connection closed
    4010
    Client uploaded an unknown text message
    5000
    Background error, please try again
    5001
    Background recognition server recognition failure, please try again
    5002
    Background recognition server recognition failure, please try again