Scenario Introduction
According to data from iiMedia Research, in 2021, the number of online Karaoke users in China was about 510 million, with a penetration rate of approximately 49.7%. Online Karaoke offers a more immersive experience and its diverse gameplay caters to the personalized needs of different user groups, becoming one of the main projects in the online pan-entertainment field. Based on network technology innovations, online Karaoke apps continue to launch diverse singing patterns and gameplay, and the continuously enriched features have enhanced the practicality and playability of online Karaoke apps. This document will provide a detailed introduction to the online Karaoke scenario-based solution based on Tencent Real-Time Communication (TRTC) in the following sections.
Implementation Scheme
|
Room Management | Room list, create a room, enter a room, exit a room, and terminate a room. |
Seat Management | Become a speaker/listener, seat control, change a speaker, lock the seat, invite a listener to speak, and mute a speaker. |
Song Selection Management | Song list display, searching songs, song selection, queue management, and selected song list. |
Karaoke Management | Karaoke play mode, start/stop/switch songs, accompaniment and vocal volume adjustment, reverb/sound effects, original sound and accompaniment switch, and lyrics synchronization |
Scoring Management | Singing Scoring and Pitch Line Display |
The overall business process of the online Karaoke scene is shown in the following diagram. The room owner creates a Karaoke room, and users can choose the Karaoke room they are interested in to enter. After entering the room, users can request to speak to participate in the interaction. After becoming a speaker, they can also choose their favorite songs to sing and wait in line. When it is their turn, they can sing along with the accompaniment. Of course, users can also choose to become a speaker directly to participate in a chorus. These are two different Karaoke play modes. During the singing process, there will be pitch scoring for individual sentences, and there will also be singing scoring for the entire song after the singing is finished.
Room Management
The Room Management module is primarily responsible for maintaining the room list, which includes functions such as create a room, enter a room, exit a room, and terminate a room. Additionally, a Karaoke room differs from regular rooms in that it requires a separate Karaoke room identifier to initiate related component management: Song Selection Management, Karaoke Management, Scoring Management, etc. Create Room: After users log in to the business system, they can create a room. The room list needs to be updated after a room is created.
Enter a Room: Users can choose to enter an existing room. Upon entering, the current list of room members should be updated.
Exit a Room: Users can choose to exit the current room. Upon exiting, the current list of room members needs to be updated with a delete operation.
Terminate Room: After all users exit the room, it needs to be terminateed. Upon destruction, the room list needs to be updated with a delete operation.
Note:
Room Management is a necessary functional module for implementing online Karaoke but is not the main functional module. Specific implementation can be achieved through integration with business systems and IM&TRTC SDKs. For details, see Voice Chat Room > Room Management. Seat Management
In a Karaoke room, seats are generally orderly and limited. Seat management primarily involves defining the number of seats in the room based on the business scenario, as well as managing the status of all seats in the current room. Seat management includes features such as become a speaker/listener, seat control, change a speaker, lock the seat, invite a listener to speak, and mute a speaker.
After users enter a room, only idle seats can be applied for.
After the room owner approves a user's speaker request, the corresponding seat status should change to occupied.
When a user stops streaming and becomes a listener, the corresponding seat status should revert to idle.
The room owner has the authority to lock the seat, invite a listener to speak, remove a speaker, mute a speaker, etc.
Note:
Seat management is a necessary functional module for implementing online Karaoke but is not the main functional module. Specific implementation can be achieved through integration with business systems and IM&TRTC SDKs. For details, see Voice Chat Room -> Seat Management. Song Selection Management
Basic Introduction
Song selection management is an important part of the online Karaoke scenario, mainly including features such as song list display, search for songs, song selection, and queue management, selected song list. Each Karaoke room needs to maintain a selected song list and an auto queue management feature, which needs to be implemented by the business backend. Meanwhile, aspects related to accompaniment resources such as song list display and song search are recommended to be implemented with accompaniment library products for overseas users.
Implementation Process
The entire song selection management, mainly involves the business-side app, business backend, and music library, where their respective functions are as follows:
Business-Side App:
Call the song selection API to report song information.
Call the change song-switching API to notify the business backend.
Call the singing confirmation API to notify the business backend.
Business Backend:
Maintain the selected song list.
Send notifications to tell the business-side app to switch the song.
Music Library:
Provide authorized music resources for TRTC to play.
Provide lyric files and pitch files matching the music resources.
Karaoke Management
The Karaoke system primarily includes functions such as: Karaoke play mode, start/stop/switch songs, accompaniment and vocal volume adjustment, reverb/sound effects, original sound and accompaniment switch, and lyrics synchronization. Below, we will introduce the implementation process of the Karaoke management module in detail through two typical Karaoke gameplay: solo singing and real-time chorus.
Solo Singing
Solo singing: Primarily in the interactive Karaoke scenario with multiple participants, after the anchor/audience members become speakers, they can proceed to select songs. Once a song selection is successful, it will be displayed collectively on the song selection platform. When it's someone's turn to select a song, the corresponding individual will play the song's accompaniment, start singing, and undergo scoring.
Solution Architecture
The overall solution primarily relies on the music library for song and lyric resources and TRTC for streaming the singer's vocals, song accompaniment, and streaming. The solution architecture is as follows:
Specific Implementation
In the solo singing scenario, different roles have different implementation processes. There are two roles: singer and audience. The description and differences of their roles are detailed in the table below:
|
Singer | The singer in the Karaoke room is evolved from the anchor/audience who selects songs and sings after becoming a speaker. After leaving the room, the room is automatically dissolved and the list of selected songs is automatically cleared. | The role must be an anchor. Upstream audio and video (no video upstream black frame) Play BGM Send SEI information (sending lyric information) Song Selection |
Audience | The audience in the Karaoke room plays the media stream of the singer or other people. | The role is an audience, but can also become an anchor by becoming a speaker. Downstream Audio and Video Streams Receive SEI information (receive lyric information) |
Implementation Process
Singer
1. The anchor/audience creates/enters a TRTC room, and automatically becomes a singer after selecting a song.
2. After the singer selects a song, the song/lyric is downloaded, and then the song is played through the BGM interface.
3. If the singer does not bring up the video upstream, they need to enable the video upstream.
4. Synchronize the lyric progress of everyone through SEI information.
5. When the singer becomes a listener, all songs they selected will be cleared, and they revert to their original role.
6. After the anchor/audience exits the room, the TRTC room will be dissolved.
Note:
Anchors/Audiences on the seat can select songs for themselves or others, but the corresponding singer must play the BGM; otherwise, it may cause asynchrony between the singer and the song due to latency (about 300 ms or more).
Audience
1. Anchor/Co-Anchor/Audience can create/enter a TRTC room.
2. Monitor room song changes and load lyrics.
3. Pull the singer's stream.
4. Parse the SEI messages sent by the singer and synchronize the lyrics.
Real-Time Chorus
Real-time chorus refers to playing the song accompaniment simultaneously on all ends based on co-mic, and then performing the chorus on the mic. In a duo pattern, the lead and backing vocals can hear each other; in a multi-person pattern, all choristers can hear each other with almost no delay, achieving true real-time chorus.
Solution Architecture
In terms of media streams, the singers publish/playback streams to each other, while one leading singer uploads accompaniment, and the other singers play accompaniment locally, synchronized via NTP. Additionally, the accompaniment and all singers' voices are mixed through a mixing bot to form a single stream, which is then pushed back to the TRTC room, allowing the audience to hear the synchronized voices from all ends by pulling a single stream, achieving a multi-person chorus effect. The real-time chorus solution architecture is shown in the figure below:
The advantages of this solution are:
It reduces end-to-end latency.
It provides a solution for users to join the chorus midway.
It accurately synchronizes accompaniment, lyrics, and vocals between different ends.
It improves the performance of devices on different ends and the accuracy of local time, and reduces the impact of network environment latency.
Note:
Depending on business needs, you can choose a real-time chorus solution for audio-only or audio and video scenarios. If it is a pure audio scenario, black frames need to be added to send SEI messages for lyric synchronization.
The lead singer needs to use a sub-instance to upstream both the accompaniment and vocals at the same time; other singers only need to pull each other's vocal streams and play accompaniment locally; the audience only needs to pull one mixed stream.
Specific Implementation
In online Karaoke rooms, different roles have different feature permissions and implementation processes, divided into three roles: lead singer, chorus, and audience, as shown in the table below:
|
Lead Singer | The lead singer is responsible for selecting songs, sending chorus signalings, and sending SEI messages. | The role must be an Anchor. Upstream accompaniment and vocals. Song selection and initiating chorus. Push Back Mixed Stream. Send SEI Message |
Chorus | Chorus can receive and process chorus signalings, and participate in the chorus on the seat. | The role must be an Anchor. Upstream Vocals Play Accompaniment Locally Receive Chorus Signals |
Audience | After entering the Karaoke room, the audience can pull the stream from the seat and also participate in the chorus on the seat. | The role must be an audience. Downstream mixed stream Receive SEI messages Request to Become an Anchor |
Implementation Process
Lead Singer
1. The lead singer needs to select songs on-demand and send chorus signalings.
2. The lead singer creates a sub-instance to push vocals and accompaniment and pulls the vocals of other singers.
3. After streaming, the lead singer is responsible for initiating the mixed stream push task.
4. After starting to sing, play the accompaniment and synchronize the lyrics through the playback progress callback.
5. SEI messages need to be sent to synchronize the song progress on the audience end.
6. Al singers need to calibrate the local song playback progress according to NTP.
Chorus
1. The chorus pushes a vocal stream, pulling the vocal stream of chorus users on the seat.
2. The singers need to listen and receive chorus signalings, preloading accompaniment resources.
3. After they start to sing and play accompaniment locally, the singers synchronizes the lyrics through the playback progress callback.
4. Al singers need to calibrate the local song playback progress according to NTP.
Audience
1. Upon entering the TRTC room, the audience receives the mixed chorus stream.
2. Parse the song progress information in the SEI of the mixed stream for lyric synchronization.
3. After the audience becomes a speaker, the mixed stream is stopped and switched to pulling the vocal stream on the seat, and the chorus mode is started.
Scoring Management
Basic Introduction
The scoring function is also one of the mainstream play methods in the Karaoke scenario, mainly judging the results of pitch accuracy and sound quality during the actual singing process. It can be used for score comparisons after multi-person singing.
Implementation Process
In the entire scoring management process, singers and audiences have different implementations based on their user roles. Scoring is usually done locally by the singer and synchronized with other people in the room.
Singer
The anchor/audience creates/enters a TRTC room, and automatically becomes a singer after selecting a song.
After the singer selects a song, the song/lyric is downloaded, and then the song is played through the BGM interface.
The vocals captured by TRTC and the progress of the BGM playback are transmitted in real-time to the rating module.
After the rating module produces the data in real time, it synchronizes with everyone in the room through SEI.
Audience
The audience side process is identical to the solo singing audience role action process; you can refer to the audience implementation process.
Key Business Logic
Accompaniment Synchronization Solution
In real-time scenarios, it is necessary to synchronize the accompaniment progress in real-time after starting the performance to avoid increasing end-to-end latency due to accompaniment errors. Synchronizing the accompaniment requires NTP time-based synchronization because the local clocks of different devices are not consistent, resulting in some errors. Therefore, Tencent Cloud's self-developed NTP service is introduced. Additionally, users who join the ensemble midway also need to synchronize the accompaniment progress. Only after synchronizing the progress can they join in the chorus.
The approach to accompaniment synchronization is: the lead singer convention starts to play the accompaniment at a future point in time (e.g., after a 3-second latency), and other users join in the chorus. All ends' time is based on NTP time, which is synchronized after the TRTC SDK initialization.
The specific process is as follows:
1. All ends calibrate the NTP time, update, and access the latest NTP time T from the TRTC cloud.
2. The lead singer sends the chorus signalings (custom message), agreeing on the chorus start time T2.
3. Preload the accompaniment locally based on T2, and schedule playback.
4. Other chorus participants follow step 3 upon receiving the chorus signalings.
5. During the process, verify the local accompaniment playback progress, and perform seek calibration when the difference between TE and TC exceeds 50 ms.
Note:
The 50 ms deviation here is a typical value, which can be adjusted appropriately based on the business tolerance, with a recommendation to fluctuate around 50 ms.
Lyrics Synchronization Solution
In the lyrics synchronization solution, the actions of three different roles are as follows:
|
NTP Time Synchronization Enable Black Frame Insertion Send SEI Message Local Lyric Synchronization. Update Lyrics Control | NTP Time Synchronization Local Lyric Synchronization. Update Lyrics Control | NTP Time Synchronization Receive SEI messages Update Lyrics Control |
Among them, the lead singer and chorus update the lyric progress locally based on the playback progress of the synchronized accompaniment; the audience needs to receive SEI messages sent from the lead vocalist containing the latest lyric progress to update the local lyric progress. The overall process of the lyrics synchronization solution is shown in the following diagram.
Music Scoring Integration Solution
The accompaniment scoring feature is an indispensable feature in the Online Karaoke scenario. You need to access the standardized audio file and MIDI pitch file of the music resources in advance, and then it is recommended to use the music scoring feature of the Intelligent Music Platform to score the singer's voice from TRTC on-cloud recording. The overall process of the Music Scoring Integration Solution is shown in the following diagram. 2. The TRTC backend will upload the recorded singing clip media file to the COS bucket specified when initiating the recording task. 5. The Intelligent Music Platform reads the singing clip and the standard pitch file from the COS bucket for scoring.
6. The Intelligent Music Platform will write the JSON file containing the scoring results into the specified path in COS.
7. After the music scoring is completed, the Intelligent Music Platform will call back the music scoring results to the business backend.
8. The business backend reads the music scoring results JSON file from COS according to the callback path.
9. The business backend analyzes the music scoring results and displays the scoring results on the singer's App.
Note:
The input file format for the Intelligent Music Platform's music scoring should use MP3 or WAV. If the on-cloud recording file format is HLS or AAC, audio transcoding is required.
Best Practices for Audio Tuning Strategies
In the entire Karaoke scenario, the audio quality is mainly affected by parameters such as sampling rate, number of channels, bitrate, and 3A. According to different room scenarios, we recommend various audio parameters and volume mixing schemes, as well as commonly used vocal and accompaniment synchronization alignment solutions.
1. Best Parameter Configuration for Different Scenarios
|
Solo Singing | Video or CDN Push Requirements: LIVE Audio-only or Pure RTC Requirements: VOICE_CHATROOM | MUSIC | TRTCSystemVolumeTypeMedia | enableBlackStream |
Real-Time Chorus | Video or CDN Push Requirements: LIVE Audio-only or Pure RTC Requirements: VOICE_CHATROOM | MUSIC | TRTCSystemVolumeTypeMedia | enableBlackStream enableChorus setLowLatencyModeEnabled |
Chat and Listen to Music | VOICE_CHATROOM | DEFAULT | TRTCSystemVolumeTypeAuto | No |
2. Best Volume Ratio for Different Scenarios
The TRTC SDK has initial default values for voice collection and music playback. If in the default situation, there is suppression of voice by accompaniment in the live streaming room, leading to the voice being masked by music, you can adjust the voice and music volume ratio according to the recommended values in the table below.
|
Solo Singing | Voice Capture Volume: 60 Music Playback Volume: 50 Enable Reverb Effect: Yes |
| Real-Time Chorus |
Chat and Listen to Music | Vocal Capture Volume: 100 Music Playback Volume: 30 Enable Reverb Effect: No |
3. Voice and Accompaniment Synchronization Alignment
Due to the JitterBuffer for local vocal capture, the JitterBuffer for song playback mixing, and the GAP that exists from when the human ear receives the accompaniment to when singing begins if the singer sings entirely in sync with the lyrics and accompaniment, the remote audience may perceive a certain latency and misalignment between the vocal, accompaniment, and lyrics. This issue can be improved through the following two methods.
Enable Chorus Mode
JSONObject jsonObject = new JSONObject();
try {
jsonObject.put("api", "enableChorus");
JSONObject params = new JSONObject();
params.put("enable", true);
params.put("audioSource", 0);
jsonObject.put("params", params);
mTRTCCloud.callExperimentalAPI(String.format(Locale.ENGLISH, jsonObject.toString()));
} catch (JSONException e) {
e.printStackTrace();
}
NSDictionary *jsonDic = @{
@"api": @"enableChorus",
@"params": @{
@"enable": @(YES),
@"audioSource": @(0)
}
};
NSData *jsonData = [NSJSONSerialization dataWithJSONObject:jsonDic options:NSJSONWritingPrettyPrinted error:nil];
NSString *jsonString = [[NSString alloc] initWithData:jsonData encoding:NSUTF8StringEncoding];
[trtcCloud callExperimentalAPI:jsonString];
Note:
audioSource: 0 (Vocal), audioSource: 1 (Accompaniment); If using single instance streaming, set audioSource to 0 for all streams.
Enable Low Latency Mode
JSONObject jsonObject = new JSONObject();
try {
jsonObject.put("api", "setLowLatencyModeEnabled");
JSONObject params = new JSONObject();
params.put("enable", true);
jsonObject.put("params", params);
mTRTCCloud.callExperimentalAPI(String.format(Locale.ENGLISH, jsonObject.toString()));
} catch (JSONException e) {
e.printStackTrace();
}
NSDictionary *jsonDic = @{
@"api": @"setLowLatencyModeEnabled",
@"params": @{
@"enable": @(1)
}
};
NSData *jsonData = [NSJSONSerialization dataWithJSONObject:jsonDic options:NSJSONWritingPrettyPrinted error:nil];
NSString *jsonString = [[NSString alloc] initWithData:jsonData encoding:NSUTF8StringEncoding];
[trtcCloud callExperimentalAPI:jsonString];
Scenario Gameplay
Solo Singing
After becoming a speaker, the audience can select songs and wait in line. After the song starts to play, they can sing solo. This game mode is relatively simple and can be achieved using TRTC single-instance mixing and streaming.
Real-Time Chorus
After becoming a speaker, the audience sings a song with the lead singer at the same time. This game mode is relatively complex. The lead singer side needs to use TRTC dual-instance streaming, and all ends also need to pay attention to accompaniment synchronization and lyric synchronization.
Mass Singing Competition
Users can choose song rooms of different categories according to their preferences. The room will randomly play music clips, and users in the room can grab the microphone at any time to sing the music clips.
Segmental Singing
The same song is divided into segments and assigned to different speakers. After the lead singer sings a segment, other speakers sing the assigned music clips respectively.
Cross-Room Singing Competition
Anchors from different rooms sing, and the audience in their respective rooms helps their anchors. In addition to the Karaoke scenario, this game mode also involves TRTC cross-room competition, and it is necessary to pay attention to the subscription logic of audio and video streams in different rooms.
Supporting Products for the Solution
|
Access Layer | | Provides a low-latency, high-quality real-time interactive live streaming solution for multiple people's audio, which is the basic foundation for online Karaoke scenarios. |
Access Layer | | Provides room management and seat management capabilities based on group features, enables the sending and receiving of rich media messages such as live streaming room-wide messaging, public screen messages, as well as custom signaling and other communication needs. |
Access Layer | | Based on the self-developed music understanding technology of Tencent Media Lab, it helps users to deeply understand, analyze and create music, and provides capabilities such as lyric recognition, intelligent composition, song recognition, and music scoring. |
Was this page helpful?