tencent cloud

Feedback

Voice Clone Recording Guide - Basic Edition

Last updated: 2024-09-18 20:42:48

    I. Custom Material Self-Check Items

    For voice clone, you need to submit an audio recording containing 100 sentences. Before submission, ensure that you check each of the following self-check items:
    1. Ensure no other voices are recorded besides the voice of the person being cloned.
    2. The audio recording should have a moderate volume, with no noticeable reverb, background noise, or other disturbances.
    3. Record using Mandarin Chinese; the text should be diverse, with no excessive repetition of sentences.
    Audio format requirements:
    1. All audio files must be converted to WAV format and submitted as a compressed ZIP package.
    2. Directly select all audio files and compress them into a ZIP package (do not create a new folder before compressing). The ZIP file should not exceed 1 GB.
    3. Each audio file must have a sampling rate of 24 kHz or higher, and the length of each file should not exceed 1 hour.
    4. Audio file names should not contain spaces or special characters.

    II. Audio Recording Guide (Text Version)

    Recording content

    Follow a pause-read-pause cycle, reading 100 sentences in sequence and generating the audio.
    Recording text: You may choose text from your field of expertise, or see the attachment reference texts. The more sentences you include, the better the training results will be.
    Text requirements: The text must be in Chinese characters. Individual sentences should not exceed 50 characters, with an average sentence length of around 20 characters.
    Number of audio files: The recording can be a single continuous segment or divided into multiple segments, with a maximum of 10 files.
    Audio format: It is recommended to use lossless WAV format for recording (specific formats are not restricted), with a sampling rate of no less than 24 kHz.

    Notes

    The environment should be quiet with no background noise. It is recommended to use a microphone with a windscreen, keeping it within 10 cm of the mouth, and maintaining a moderate volume.
    Avoid recording in rooms with smooth walls or floors, such as large glass walls or marble floors, to prevent introducing reverb.
    Familiarize yourself with the text before recording to avoid interruptions or disjointed reading.
    Be careful to avoid microphone popping.
    Pause naturally at the end of each sentence; during the sentence, pause naturally according to the text’s normal flow.
    Read with rhythm and intonation that reflects your natural speaking style.
    Articulate clearly, ensuring that all pronunciations are accurate.
    Avoid making any other movements while speaking to prevent unnecessary noises (e.g., clothing rustling and swallowing voice).
    Note:
    The quality of the custom audio is closely tied to the original recording. High-quality audio will result in a better voice clone, while poor audio quality will lead to a subpar final result.
    For example, if the original audio contains noise, the final customized output will also include that noise.

    III. Typical Issues

    Popping voice
    Avoid popping voice, which typically occurs when the microphone is too close, lacks a pop filter, or the recording volume is too high.
    Lip smacks, saliva noises, breathing, and microphone pops
    Avoid excessive lip smacking, saliva noises, and noticeable breathing voice caused by frequent mouth opening and closing or swallowing during the recording process. Minimize microphone pops as well.
    Noise and reverb
    Avoid placing the microphone too far from the mouth and recording in environments with significant background noise, such as voices, air conditioning, or background music. Also, avoid introducing reverb, which is often pronounced in rooms with many glass surfaces or smooth walls.
    Missing frequency spectrum
    Avoid using recording software with built-in enhancement or noise reduction modules, as these can damage the original audio and result in missing frequency bands in the spectrum.

    IV. Audio Quality Detection Interface Specification Explanation

    Currently, the Audio Quality Testing Task Creation API allows you to detect the following metrics, which help identify issues within the audio. The metric descriptions are as follows:
    Signal-to-noise ratio (SNR): The ratio of the useful signal energy to the noise energy in the audio. The higher, the better. An SNR of 30 or above is considered acceptable.
    Causes of low SNR:
    This may be due to a noisy recording environment. Consider recording in a quieter location.
    It could also be due to the mouth being too far from the microphone, resulting in insufficient useful signal energy. Adjust the distance between the microphone and the mouth to around 10 cm. (Being too close may cause microphone pops or clipping.)
    Reverberation index: The ratio of useful signal energy to echo energy in the audio. The higher, the better. A value of 30 or above is considered acceptable.
    Causes of low reverberation index:
    It may be due to an unsuitable recording environment that produces echoes. Large spaces or hard walls can easily generate echoes. Try to record in smaller spaces with more soft surfaces, such as a bedroom or inside a car.
    Clipping: Clipping indicates that parts of the audio exceed the maximum allowable amplitude, which, simply put, means the audio volume is too high. A value of 0 or less is considered acceptable.
    Causes of clipping:
    This is typically caused by the mouth being too close to the microphone during recording. Adjust the distance between the microphone and the mouth to about 10 cm.
    It may also be due to the recording software's volume setting being too high. This can be resolved by lowering the volume in the recording software.
    Waveform illustration of clipped audio:
    
    
    
    Waveform illustration of audio with no clipping:
    
    
    
    Partial audio examples:
    The attachment includes example audio clips labeled as High-Quality Audio, Reverberation Not Meeting Standards, Signal-to-Noise Ratio Not Meeting Standards, Both SNR and Reverberation Not Meeting Standards, and Audio with Clipping. These are available for download and listening.
    Audio_examples.zip(1.1MB)
    
    
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support