tencent cloud

All product documents
APIs
Submitting Speech Recognition Job
Last updated: 2024-03-04 10:11:42
Submitting Speech Recognition Job
Last updated: 2024-03-04 10:11:42

Overview

This API is used to submit a speech recognition job.
Note:
For constraints on speech recognition, see Specifications and Limits.


Request

Sample request

POST /asr_jobs HTTP/1.1
Host: <BucketName-APPID>.ci.<Region>.myqcloud.com
Date: <GMT Date>
Authorization: <Auth String>
Content-Length: <length>
Content-Type: application/xml

<body>
Note:
Authorization: Auth String (for more information, see Request Signature).
When this feature is used by a sub-account, relevant permissions must be granted as instructed in Authorization Granularity Details.

Request headers

This API only uses Common Request Headers.

Request body

This request requires the following request body:
<Request>
<Tag>SpeechRecognition</Tag>
<Input>
<Object>input/test.mp3</Object>
<Url></Url>
</Input>
<Operation>
<SpeechRecognition>
<EngineModelType>16k_zh_video</EngineModelType>
<ChannelNum>1</ChannelNum>
<FilterDirty>1</ChannelNum>
<FilterModal>1</ChannelNum>
</SpeechRecognition>
<Output>
<Region>ap-chongqing</Region>
<Bucket>test-123456789</Bucket>
<Object>output/asr.txt</Object>
</Output>
<UserData>This is my data.</UserData>
<JobLevel>0</JobLevel>
</Operation>
<QueueId>pcd463e1467964d39ad2d3f66aacd8199</QueueId>
<CallBack>http://callback.demo.com</CallBack>
<CallBackFormat>JSON<CallBackFormat>
</Request>
The nodes are described as follows:
Node Name (Keyword)
Parent Node
Description
Type
Required
Request
None
Request container
Container
Yes
Request has the following sub-nodes:
Node Name (Keyword)
Parent Node
Description
Type
Required
Tag
Request
Job type: SpeechRecognition.
String
Yes
Input
Request
Information about the object to be operated
Container
Yes
Operation
Request
Operation rule
Container
Yes
QueueId
Request
ID of the queue where the job is in
String
Yes
CallBackFormat
Request
Job callback format, which can be JSON or XML (default). It has a higher priority than that of the queue.
String
No
CallBackType
Request
Job callback type, which can be Url (default) or TDMQ. It has a higher priority than that of the queue.
String
No
CallBack
Request
Job callback address, which has a higher priority than that of the queue. If it is set to no, no callbacks will be generated at the callback address of the queue.
String
No
CallBackMqConfig
Request
TDMQ configuration for job callback as described in Structure > CallBackMqConfig, which is required if CallBackType is TDMQ.
Container
No
Input has the following sub-nodes:
Node Name (Keyword)
Parent Node
Description
Type
Required
Object
Request.Input
Media filename
String
No
Url
Request.Input
URL that supports downloading over the public network. At least one of Url and Object must be passed in. If both parameters are passed in, Object is preferred.
String
No
Operation has the following sub-nodes:
Node Name (Keyword)
Parent Node
Description
Type
Required
SpeechRecognition
Request.Operation
Speech recognition parameter
Container
Yes
Output
Request.Operation
Result output address
Container
Yes
UserData
Request.Operation
The user information passed through, which is printable ASCII codes of up to 1,024 in length.
String
No
JobLevel
Request.Operation
Job priority. The greater the value, the higher the priority. Valid values: 0, 1, 2. Default value: 0.
String
No
SpeechRecognition has the following sub-nodes:

SpeechRecognition has the following sub-nodes:
Node Name (Keyword)
Parent Node
Description
Type
Required
EngineModelType
Request.Operation.Speech Recognition
Engine model type.
Phone call scenario:
8k_zh: 8 kHz Mandarin for general fields (applied to dual-channel audios)
8k_zh_s: 8 kHz Mandarin for speaker diarization (applied only to mono-channel audios)
Non-phone call scenario:
16k_zh: 16 kHz Mandarin for general fields
16k_zh_video: 16 kHz Mandarin for audiovisual fields
16k_en: 16 kHz English
16k_ca: 16 kHz Cantonese
String
Yes
ChannelNum
Request.Operation.Speech Recognition
Number of sound channels:
1: Mono
2: Dual (for the 8k_zh engine only)
Integer
Yes
ResTextFormat
Request.Operation.Speech Recognition
Format of the returned recognition result.
0: recognition result text, including the list of segment timestamps
1: recognition result details, including the list of word timestamps (generally used to generate subtitles and for the 16 kHz Mandarin engine only).
Integer
Yes
FilterDirty
Request.Operation.Speech Recognition
Whether to filter restricted words (for the Mandarin engine only).
0: Does not filter.
1: Filters.
2: Replaces restricted words with *. Default value: 0.
Integer
No
FilterModal
Request.Operation.Speech Recognition
Whether to filter modal particles (for the Mandarin engine only).
0: Does not filter.
1: Filters partially.
2: Filters strictly. Default value: 0.
Integer
No
ConvertNumMode
Request.Operation.Speech Recognition
Whether to intelligently convert Chinese numbers to Arabic numerals (for the Mandarin engine only):
0: Directly outputs Chinese numbers.
1: Intelligently converts based on the scenario. Default value: 1.
Integer
No
Output has the following sub-nodes:
Node Name (Keyword)
Parent Node
Description
Type
Required
Region
Request.Operation.Output
Bucket region
String
Yes
Bucket
Request.Operation.Output
Result storage bucket
String
Yes
Object
Request.Operation.Output
Result file name
String
Yes

Response

Response headers

This API only returns Common Response Headers.

Response body

The response body returns application/xml data. The following contains all the nodes:
<Response>
<JobsDetail>
<Code>Success</Code>
<CreationTime>2021-08-05T15:43:50+0800</CreationTime>
<EndTime>-</EndTime>
<Input>
<BucketId>test-1234567890</BucketId>
<Object>input/test.mp3</Object>
<Region>ap-chongqing</Region>
</Input>
<JobId>s58ccb634149211ed84ce2b1cd7fbb14a</JobId>
<Message/>
<Operation>
<JobLevel>0</JobLevel>
<Output>
<Bucket>test-1234567890</Bucket>
<Object>output/asr.txt</Object>
<Region>ap-chongqing</Region>
</Output>
<SpeechRecognition>
<ChannelNum>1</ChannelNum>
<ConvertNumMode>0</ConvertNumMode>
<EngineModelType>16k_zh_video</EngineModelType>
<FilterDirty>0</FilterDirty>
<FilterModal>0</FilterModal>
<FilterPunc>0</FilterPunc>
<OutputFileType>txt</OutputFileType>
<ResTextFormat>0</ResTextFormat>
<SpeakerDiarization>0</SpeakerDiarization>
<SpeakerNumber>0</SpeakerNumber>
</SpeechRecognition>
<UserData>This is my data.</UserData>
<JobLevel>0</JobLevel>
</Operation>
<QueueId>pcd463e1467964d39ad2d3f66aacd8199</QueueId>
<StartTime>-</StartTime>
<State>Submitted</State>
<Tag>SpeechRecognition</Tag>
</JobsDetail>
</Response>
The nodes are described as follows:
Node Name (Keyword)
Parent Node
Description
Type
Response
None
Result storage container
Container
Response has the following sub-nodes:
Node Name (Keyword)
Parent Node
Description
Type
JobsDetail
Response
Job details
Container
JobsDetail has the following sub-nodes:
Node Name (Keyword)
Parent Node
Description
Type
Code
Response.JobsDetail
Error code, which is returned only if State is Failed
String
Message
Response.JobsDetail
Error message, which is returned only if State is Failed
String
JobId
Response.JobsDetail
Job ID
String
Tag
Response.JobsDetail
Job type: SpeechRecognition
String
State
Response.JobsDetail
Job status. Valid values: Submitted, Running, Success, Failed, Pause, Cancel.
String
CreationTime
Response.JobsDetail
Job creation time
String
StartTime
Response.JobsDetail
Job start time
String
EndTime
Response.JobsDetail
Job end time
String
QueueId
Response.JobsDetail
ID of the queue where the job is in
String
Input
Response.JobsDetail
Same as the Request.Input node in the request
Container
Operation
Response.JobsDetail
Operation rule. Up to 6 operation rules are supported.
Container
Operation has the following sub-nodes:
Node Name (Keyword)
Parent Node
Description
Type
SpeechRecognition
Response.JobsDetail.Operation
Same as Request.Operation.SpeechRecognition in the request.
Container
Output
Response.JobsDetail.Operation
Same as Request.Operation.Output in the request.
Container
UserData
Response.JobsDetail.Operation
The user information passed through.
String
JobLevel
Response.JobsDetail.Operation
Job priority
String
SpeechRecognitionResult
Response.JobsDetail.Operation
Speech recognition job result. It is not returned when there is no result.
Container
SpeechRecognitionResult has the following sub-nodes:
Node Name (Keyword)
Parent Node
Description
Type
AudioTime
Response.JobsDetail.Operation.SpeechRecognitionResult
Audio duration (seconds)
String
Result
Response.JobsDetail.Operation.SpeechRecognitionResult
Speech recognition result
String
FlashResult
Response.JobsDetail.Operation.SpeechRecognitionResult
Result of fast speech recognition
Container array
ResultDetail
Response.JobsDetail.Operation.SpeechRecognitionResult
Recognition result details, including the time offsets of words in each sentence (generally used to generate subtitles). This field will not be empty if ResTextFormat in the speech recognition request is 1.
Note: This field may be empty, indicating that no valid values can be obtained.
Container array
FlashResult has the following sub-nodes:
Node Name (Keyword)
Parent Node
Description
Type
channel_id
Response.JobsDetail.Operation.SpeechRecognitionResult.FlashResult
Sound channel ID, which starts from 0 and is corresponding to the number of audio channels
Integer
text
Response.JobsDetail.Operation.SpeechRecognitionResult.FlashResult
Complete audio recognition results of the sound channel
String
sentence_list
Response.JobsDetail.Operation.SpeechRecognitionResult.FlashResult
List of recognition results at sentence/paragraph level
Container array
sentence_list has the following sub-nodes:
Node Name (Keyword)
Parent Node
Description
Type
text
Response.JobsDetail.Operation.SpeechRecognitionResult.FlashResult.sentence_list
Sentence/paragraph-level text
String
start_time
Response.JobsDetail.Operation.SpeechRecognitionResult.FlashResult.sentence_list
Start time
Integer
end_time
Response.JobsDetail.Operation.SpeechRecognitionResult.FlashResult.sentence_list
End time
Integer
speaker_id
Response.JobsDetail.Operation.SpeechRecognitionResult.FlashResult.sentence_list
Speaker ID (If speaker_diarization is set in the request, speakers can be distinguished by speaker_id.)
Integer
word_list
Response.JobsDetail.Operation.SpeechRecognitionResult.FlashResult.sentence_list
List of recognition results at word level
Container array
word_list has the following sub-nodes:
Node Name (Keyword)
Parent Node
Description
Type
word
Response.JobsDetail.Operation.SpeechRecognitionResult.FlashResult.sentence_list.word_list
Word-level text
String
start_time
Response.JobsDetail.Operation.SpeechRecognitionResult.FlashResult.sentence_list.word_list
Start time
Integer
end_time
Response.JobsDetail.Operation.SpeechRecognitionResult.FlashResult.sentence_list.word_list
End time
Integer
ResultDetail has the following sub-nodes:
Node Name (Keyword)
Parent Node
Description
Type
FinalSentence
Response.JobsDetail.Operation.SpeechRecognitionResult.ResultDetail
Final recognition result of a single sentence
String
SliceSentence
Response.JobsDetail.Operation.SpeechRecognitionResult.ResultDetail
Middle recognition result of a single sentence, which is split into multiple words using spaces
String
StartMs
Response.JobsDetail.Operation.SpeechRecognitionResult.ResultDetail
Start time of a single sentence (millisecond)
String
EndMs
Response.JobsDetail.Operation.SpeechRecognitionResult.ResultDetail
End time of a single sentence (millisecond)
String
WordsNum
Response.JobsDetail.Operation.SpeechRecognitionResult.ResultDetail
Number of words in a single sentence
String
SpeechSpeed
Response.JobsDetail.Operation.SpeechRecognitionResult.ResultDetail
Speed of a single sentence. Unit: words/second
String
SpeakerId
Response.JobsDetail.Operation.SpeechRecognitionResult.ResultDetail
Sound channel or speaker ID (If speaker_diarization is set or ChannelNum is set to Dual in the request, the speakers or channels can be distinguished.)
String
Words
Response.JobsDetail.Operation.SpeechRecognitionResult.ResultDetail
Word details in a single sentence
Container array
Words has the following sub-nodes:
Node Name (Keyword)
Parent Node
Description
Type
Word
Response.JobsDetail.Operation.SpeechRecognitionResult.ResultDetail.Words
Word text
String
OffsetStartMs
Response.JobsDetail.Operation.SpeechRecognitionResult.ResultDetail.Words
Start time offset in a sentence
String
OffsetEndMs
Response.JobsDetail.Operation.SpeechRecognitionResult.ResultDetail.Words
End time offset in a sentence
String

Error codes

No special error message will be returned for this request. For the common error messages, please see Error Codes.

Examples

Request

POST /asr_jobs HTTP/1.1
Authorization: q-sign-algorithm=sha1&q-ak=AKIDZfbOAo7cllgPvF9cXFrJD0a1ICvR****&q-sign-time=1497530202;1497610202&q-key-time=1497530202;1497610202&q-header-list=&q-url-param-list=&q-signature=28e9a4986df11bed0255e97ff90500557e0ea057
Host: test-1234567890.ci.ap-beijing.myqcloud.com
Content-Length: 166
Content-Type: application/xml

<Request>
<Tag>SpeechRecognition</Tag>
<Input>
<Object>input/test.mp3</Object>
</Input>
<Operation>
<SpeechRecognition>
<EngineModelType>16k_zh_video</EngineModelType>
<ChannelNum>1</ChannelNum>
<FilterDirty>1</ChannelNum>
<FilterModal>1</ChannelNum>
</SpeechRecognition>
<Output>
<Region>ap-chongqing</Region>
<Bucket>test-123456789</Bucket>
<Object>output/asr.txt</Object>
</Output>
<UserData>This is my data.</UserData>
<JobLevel>0</JobLevel>
</Operation>
<QueueId>pcd463e1467964d39ad2d3f66aacd8199</QueueId>
<CallBack>http://callback.demo.com</CallBack>
<CallBackFormat>JSON<CallBackFormat>
</Request>

Response

HTTP/1.1 200 OK
Content-Type: application/xml
Content-Length: 230
Connection: keep-alive
Date: Mon, 28 Jun 2022 15:23:12 GMT
Server: tencent-ci
x-ci-request-id: NTk0MjdmODlfMjQ4OGY3XzYzYzhf****

<Response>
<JobsDetail>
<Code>Success</Code>
<CreationTime>2021-08-05T15:43:50+0800</CreationTime>
<EndTime>-</EndTime>
<Input>
<BucketId>test-1234567890</BucketId>
<Object>input/test.mp3</Object>
<Region>ap-chongqing</Region>
</Input>
<JobId>s58ccb634149211ed84ce2b1cd7fbb14a</JobId>
<Message/>
<Operation>
<JobLevel>0</JobLevel>
<Output>
<Bucket>test-1234567890</Bucket>
<Object>output/asr.txt</Object>
<Region>ap-chongqing</Region>
</Output>
<SpeechRecognition>
<ChannelNum>1</ChannelNum>
<ConvertNumMode>0</ConvertNumMode>
<EngineModelType>16k_zh_video</EngineModelType>
<FilterDirty>0</FilterDirty>
<FilterModal>0</FilterModal>
<FilterPunc>0</FilterPunc>
<OutputFileType>txt</OutputFileType>
<ResTextFormat>0</ResTextFormat>
<SpeakerDiarization>0</SpeakerDiarization>
<SpeakerNumber>0</SpeakerNumber>
</SpeechRecognition>
<UserData>This is my data.</UserData>
<JobLevel>0</JobLevel>
</Operation>
<QueueId>pcd463e1467964d39ad2d3f66aacd8199</QueueId>
<StartTime>-</StartTime>
<State>Submitted</State>
<Tag>SpeechRecognition</Tag>
</JobsDetail>
</Response>

Was this page helpful?
You can also Contact Sales or Submit a Ticket for help.
Yes
No

Feedback

Contact Us

Contact our sales team or business advisors to help your business.

Technical Support

Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

7x24 Phone Support
Hong Kong, China
+852 800 906 020 (Toll Free)
United States
+1 844 606 0804 (Toll Free)
United Kingdom
+44 808 196 4551 (Toll Free)
Canada
+1 888 605 7930 (Toll Free)
Australia
+61 1300 986 386 (Toll Free)
EdgeOne hotline
+852 300 80699
More local hotlines coming soon