tencent cloud

Feedback

Real-Time Speech Recognition (WebSocket)

Last updated: 2024-11-06 11:52:57
    Note:

    This API is on v2.0 and is different from API 3.0 in terms of parameter styles and error codes.

    API Description

    This API is used to recognize a real-time audio stream over the WebSocket protocol and return the recognition result synchronously. It achieves the effect of instant speech-to-text conversion.
    Before using this API, you need to activate the service in the ASR console first. Then, go to the Manage API Key page to create a key and generate AppID, SecretID, and SecretKey required for calculating authentication signatures during API call.

    API Requirements

    When integrating the real-time speech recognition API, you need to comply with the following requirements:

    Item Description
    Language Supported languages include Mandarin, English, Korean, Japanese, Thai, Bahasa, Vietnamese, Malay, Filipino, Portuguese, Turkish, Arabic, Spanish, Hindi and French. You can set the language through the engine_model_type parameter.
    Industry General, finance, gaming, education, and healthcare.
    Audio attribute Sample rate: 16000 Hz or 8000 Hz
    Bit depth: 16 bits
    Channel: mono
    Audio format PCM, WAV, Opus, Speex, SILK, MP3, M4A, and AAC
    Request protocol WSS
    Request address wss://asr.cloud.tencent.com/asr/v2/<appid>?{request parameter}
    API authentication Signature-based authentication. For more information, see Signature generation.
    Response format JSON
    Data sending We recommend you send a packet of 40 ms in length every 40 ms (i.e., real-time factor (RTF) of 1:1). The corresponding PCM is 640 bytes at a sample rate of 8 kHz or 1,280 bytes at a sample rate of 16 kHz.
    An audio sending rate exceeding the 1:1 RTF or an audio packet sending interval exceeding 6 seconds may cause an error in the engine. In this case, the backend will return an error and close the connection.
    Concurrency limit The number of concurrent connections per account is 50. If you need more concurrent connections, submit a ticket for application.

    API Call Process

    The API call process can be divided into two phases: handshake and recognition. In both phases, the backend returns a text message containing a serialized JSON string in the following format:

    Field Type Description
    code Integer The status code. 0: normal; other values: an error occurred.
    message String The error message. This field explains the cause of the error. Note that the returned messages are subject to changes as the service is updated or optimized.
    voice_id String The unique ID of the audio stream, which is generated by the client during the handshake and assigned in the call parameters.
    message_id String The unique ID of the message.
    result Result The latest speech recognition result.
    final Integer If this field returns 1, all of the audio stream has been recognized.

    The structure of the recognition result Result is in the following format:

    Field Type Description
    slice_type Integer The recognition result type.
  • 0: started recognizing a paragraph
  • 1: recognizing a paragraph, where voice_text_str indicates an unstable result (the recognition result of the paragraph may change)
  • 2: stopped recognizing a paragraph, where voice_text_str indicates a stable result (the recognition result of the paragraph will not change)
    Depending on the sent audio, the following slice_type sequence may be returned:
  • 0-1-2: started recognizing a paragraph, recognizing a paragraph (1 may be returned multiple times), stopped recognizing a paragraph
  • 0-2: started recognizing a paragraph, stopped recognizing a paragraph
  • 2: returned the complete recognition result of a paragraph
  • index Integer The number of the result of the current paragraph in the entire audio stream, which increases from 0.
    start_time Integer The start time of the result of the current paragraph in the entire audio stream
    end_time Integer The end time of the result of the current paragraph in the entire audio stream
    voice_text_str String The text result of the current paragraph encoded in UTF8
    word_size Integer The number of word results of the current paragraph
    word_list Word Array The list of words contained in the current paragraph. The structure of a word is in the following format:
    word: the content of the word in string type
    start_time: the start time of the word in the entire audio stream in integer type
    end_time: the end time of the word in the entire audio stream in integer type
    stable_flag: the status of the result in integer type. 0: unstable; 1: stable.

    Handshake phase

    Request format

    During the handshake, the client initiates a WebSocket connection request with a URL in the following format:

    wss://asr.cloud.tencent.com/asr/v2/<appid>?{request parameter}
    

    Here, the <appid> needs to be replaced with the AppID of your Tencent Cloud account, which can be obtained on the Manage API Key page. The format of the {request parameter} is as follows:

    key1=value2&key2=value2... (both `key` and `value` need to be URL-encoded)
    

    Parameter description:

    Parameter Required Type Description
    secretid Yes String The SecretId of the key of your Tencent Cloud account, which can be obtained on the Manage API Key page.
    timestamp Yes Integer The current UNIX timestamp in seconds. If the difference between the UNIX timestamp and the current time is too large, a signature expiration error may occur.
    expired Yes Integer The UNIX timestamp of the signature expiration time in seconds. expired must be greater than timestamp, and expired - timestamp must be smaller than 90 days.
    nonce Yes Integer A random positive integer, which you need to generate on your own and can contain up to ten digits.
    engine_model_type Yes String The engine model type
    For phone call scenarios:
    • 8k_zh:8kHz, for Mandarin in general scenarios,
    • 8k_en:8kHz, for English;
    For non-phone call scenarios:
    • 16k_zh:16 kHz, for Mandarin in general scenarios;
    • 16k_en:16 kHz, for English;
    • 16k_ko:16 kHz, for Korean;
    • 16k_ja:16 kHz, for Japanese;
    • 16k_th:16 kHz, for Thai;
    • 16k_id:16 kHz, for Bahasa;
    • 16k_vi:16kHz, for Vietnamese;
    • 16k_ms:16kHz, for Malay;
    • 16k_fil:16kHz, for Filipino;
    • 16k_pt:16kHz, for Portuguese;
    • 16k_tr:16kHz, for Turkish;
    • 16k_ar:16kHz, for Arabic;
    • 16k_es:16kHz, for spanish;
    • 16k_hi:16kHz, for Hindi;
    • 16k_fr:16kHz, for French;
    voice_id Yes String The unique identifier of each audio, which you need to generate on your own.
    voice_format No Int The audio encoding format, which is optional. The default value is 4. 1: PCM; 4: Speex (SP); 6: SILK; 8: MP3; 10: Opus (Opus audio stream encapsulation description); 12: WAV; 14: M4A (each segment must be a complete M4A audio); 16: AAC
    needvad No Integer 0: disables VAD; 1: enables VAD
    If an audio segment exceeds 60 seconds in length, you should enable voice activity detection (VAD).
    hotword_id No String The ID of the keyword list. If this parameter is set, the corresponding list will take effect; otherwise, the default keyword list will be used.
    reinforce_hotword No Integer Hot word enhancement function. The default is 0, 0: not enabled, 1: enabled
    After activation (only 8k_zh, 16k_zh supported), the homophonic replacement function will be enabled, and homophonic words and phrases will be configured in the hot words section.
    customization_id No String The ID of the self adaptive learning model. If this parameter is set, the corresponding model will take effect; otherwise, the last launched model will be used.
    filter_dirty No Integer Whether to filter restricted words (for the Mandarin engine only). 0 (default value): does not filter; 1: filters; 2: replaces restricted words with "*".
    filter_modal No Integer Whether to filter interjections (for the Mandarin engine only). 0 (default value): does not filter; 1: filters; 2: filters strictly.
    filter_punc No Integer Whether to filter the period at the end of a sentence (for the Mandarin engine only). 0 (default value): no; 1: yes.
    filter_empty_result NO Integer Whether to callback and recognize empty results, default to 1. 0: Callback null result; 1: Do not callback empty results;
    Attention:If slice_type=0 and slice_type=2 need to pair callbacks, filter_empty_result=0 needs to be set. In general, pairing and returning are required in outbound calling scenarios, and slice_type=0 is used to determine whether there is a human voice appearing.
    convert_num_mode No Integer Whether to intelligently convert Chinese numbers to Arabic numerals (for the Mandarin engine only). 0: directly outputs Chinese numbers; 1 (default value): intelligently converts based on the scenario; 3: enables mathematic number conversion.
    word_info No Int Whether to display the word-level timestamp. 0 (default value): does not display; 1: displays timestamps without punctuation marks; 2: displays timestamps with punctuation marks. Supported engines include 16k_zh, 16k_en, and 16k_ja.
    vad_silence_time No Integer The speech segmentation detection threshold in ms. A silence longer than this threshold will be considered as segmentation (this parameter is commonly used in customer service scenarios and requires needvad = 1). Value range: 240–2000. We recommend you not adjust this parameter; otherwise, the recognition result may be affected. Supported engine is 16k_zh.
    max_speak_time No Integer Mandatory sentence breaking function, with a value range of 5000-90000 (in milliseconds) and a default value of 0 (not enabled). In the case of continuous speech without interruption, this parameter will implement forced sentence breaking (the result becomes steady state, slice_type=2). For example, in game commentary scenarios, if the commentator continuously explains without interruption and cannot break a sentence, setting this parameter to 10000 will receive a callback from slice_type=2 every 10 seconds.
    noise_threshold No Float The noise parameter threshold is set to 0 by default, with a value range of [-1,1]. For some audio clips, the larger the value, the greater the noise level. The smaller the value, the greater the judgment of human voice.
    Caution: may affect recognition effectiveness.
    signature Yes String The API signature parameter.
    hotword_list No String Temporary hot word list: This parameter is used to improve recognition accuracy.
    • Single hot word limit: "Hot word weight", with a maximum of 30 characters per hot word and a weight of 1-11, such as "Tencent Cloud |5" or "ASR |11";
    • Temporary hot word list restriction: Multiple hot words can be separated by English commas, supporting a maximum of 128 hot words, such as "Tencent Cloud | 10, Speech Recognition |5, ASR | 11";
    • Parameter hotword_id (Hot Word List) and hotword_list (temporary hot word list) difference:
      • hotword_id:Hot word list. You need to first create a hot word list in the console or interface to obtain the corresponding hotword_ Pass in the ID parameter to use the hot word function;
      • hotword_list:Temporary hot word list. Every time a request is made, a temporary hot word list is directly passed in to use the hot word function, and the cloud does not retain the temporary hot word list. Suitable for users with a high demand for hot words;
    Attention:
    • If both hotword_id and hotword_list are passed in at the same time, hotword_list will be used first;
    • When the weight of hot words is set to 11, the current hot words will be upgraded to super hot words. It is recommended to only set important and must be effective hot words to 11. Setting too many hot words with a weight of 11 will affect the overall word accuracy.
    input_sample_rate NO Interge Supporting 8k audio in PCM format and upsampling it to 16k for recognition when the engine sampling rate does not match, can effectively improve recognition accuracy.Only supported: 8000. For example, if 8000 is passed in, the PCM audio sampling rate is 8k, and when the engine selects 16k_ Zh, then the PCM audio with a sampling rate of 8k can be at 16k_ Normal recognition under the zh engine.
    Note: This parameter is only applicable to PCM format audio. If no value is passed in, the default state will be maintained, that is, the engine sampling rate called by default is equal to the PCM audio sampling rate.

    Signature generation

    1. Sort all parameters except signature in lexicographical order to concatenate them into the request URL as the original signature string. Here, Appid=1259228442, SecretId=AKID*****************************0r is concatenated into the string as follows:

      asr.cloud.tencent.com/asr/v2/1259228442?engine_model_type=16k_zh&expired=1592380492&filter_dirty=1&filter_modal=1&filter_punc=1&needvad=1&nonce=1592294092123&secretid=AKID*****************************u0r×tamp=1592294092&voice_format=1&voice_id=RnKu9FODFHK5FPpsrN
      
    2. Encrypt the original signature string by using SecretKey with HMACSHA1 and then Base64-encode it. For example, perform this operation on the original signature string generated in the previous step with SecretKey=kFpwoX5RYQ2SkqpeHgqmSzHK7h3A2fni to get:

      Base64Encode(HmacSha1("asr.cloud.tencent.com/asr/v2/1259228442?engine_model_type=16k_zh&expired=1592380492&filter_dirty=1&filter_modal=1&filter_punc=1&needvad=1&nonce=1592294092123&secretid=AKID******************************0r&timestamp=1592294092&voice_format=1&voice_id=RnKu9FODFHK5FPpsrN", "kFpwoX5RYQ2SkqpeHgqmSzHK7h3A2fni"))
      

    The obtained signature value is as follows:

    HepdTRX6u155qIPKNKC+3U0j1N0=
    
    1. After the value of signature is URL-encoded (URL-encoding is required; otherwise, authentication may fail), the final request URL obtained through concatenation is:
      wss://asr.cloud.tencent.com/asr/v2/1259228442?engine_model_type=16k_zh&expired=1592380492&filter_dirty=1&filter_modal=1&filter_punc=1&needvad=1&nonce=1592294092123&secretid=AKID*****************************u0r×tamp=1592294092&voice_format=1&voice_id=RnKu9FODFHK5FPpsrN&signature=HepdTRX6u155qIPKNKC%2B3U0j1N0%3D
      

    Opus audio stream encapsulation description

    The size of a compressed frame should be fixed at 640, which means compressing 640 shorts at a time; otherwise, decompression will fail. The spliced frames can be passed to the server, and each frame must meet the following format requirements:
    Each frame of compressed data is encapsulated as follows:

    OpusHead (four bytes) Frame data length (two bytes) One frame of compressed Opus data
    opus Length (len) Opus-decoded data in the length of len

    Request response

    After the client initiates a connection request, the backend will establish a connection and verify the signature. If the verification is passed, the backend will return an acknowledgment message with the code 0 to indicate that the handshake is successful; otherwise, the backend will return a message with a code other than 0 and close the connection.

     {"code":0,"message":"success","voice_id":"RnKu9FODFHK5FPpsrN"}
    

    Recognition phase

    After the handshake is successful, the backend will proceed to the recognition phase, where the client uploads audio data and receives the recognition result messages.

    Uploading data

    During recognition, the client continuously uploads binary messages containing binary audio stream data to the backend. We recommend you send a packet of 40 ms in length every 40 ms (i.e., real-time factor (RTF) of 1:1). The corresponding PCM is 640 bytes at a sample rate of 8 kHz or 1,280 bytes at a sample rate of 16 kHz. An audio sending rate exceeding the 1:1 RTF or an audio packet sending interval exceeding 6 seconds may cause an error in the engine. In this case, the backend will return an error and close the connection.
    After the audio stream is uploaded, the client needs to send the following text message to the backend to end the recognition.

    {"type": "end"}
    

    Receiving a message

    While uploading data, the client needs to receive the real-time recognition result returned by the backend synchronously. Below is a sample result:

     {"code":0,"message":"success","voice_id":"RnKu9FODFHK5FPpsrN","message_id":"RnKu9FODFHK5FPpsrN_11_0","result":{"slice_type":0,"index":0,"start_time":0,"end_time":1240,"voice_text_str":"real-time","word_size":0,"word_list":[]}}
    
    {"code":0,"message":"success","voice_id":"RnKu9FODFHK5FPpsrN","message_id":"RnKu9FODFHK5FPpsrN_33_0","result":{"slice_type":2,"index":0,"start_time":0,"end_time":2840,"voice_text_str":"real-time speech recognition","word_size":0,"word_list":[]}}
    

    After the backend recognizes all the uploaded audio data, it will return a message with the value of final being 1 and close the connection.

    {"code":0,"message":"success","voice_id":"CzhjnqBkv8lk5pRUxhpX","message_id":"CzhjnqBkv8lk5pRUxhpX_241","final":1}
    

    If an error occurs during recognition, the backend will return a message with a code other than 0 and close the connection.

    {"code":4008,"message":"The backend recognition server timed out while waiting for an audio segment.","voice_id":"CzhjnqBkv8lk5pRUxhpX","message_id":"CzhjnqBkv8lk5pRUxhpX_241"}
    

    Developer Resources

    SDK

    Sample SDK call

    Error Codes

    Value Description
    4001 The parameter is invalid. For more information, see the message.
    4002 Authentication failed.
    4003 The AppID has not activated the service. Activate the service in the console first.
    4004 No free tier is available.
    4005 The service has been interrupted due to overdue payments under your account. Top up your account in time.
    4006 The current number of concurrent calls under the account exceeded the limit.
    4007 Audio decoding failed. Check whether the format of the uploaded audio data is consistent with the call parameter.
    4008 The client timed out while uploading data.
    4009 The client was disconnected.
    4010 An unknown text message was uploaded from the client.
    5000 A backend error occurred. Try again.
    5001 Recognition failed on the backend recognition server. Try again.
    5002 Recognition failed on the backend recognition server. Try again.
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support