fix customvideo
This commit is contained in:
parent
391512f68c
commit
a9e9cfb220
169
README.md
169
README.md
|
@ -57,54 +57,37 @@ export HF_ENDPOINT=https://hf-mirror.com
|
||||||
备注:服务端需要开放端口 tcp:8000,8010,1985; udp:8000
|
备注:服务端需要开放端口 tcp:8000,8010,1985; udp:8000
|
||||||
|
|
||||||
## 3. More Usage
|
## 3. More Usage
|
||||||
### 3.1 使用LLM模型进行数字人对话
|
分别选择数字人模型、传输方式、tts模型
|
||||||
|
|
||||||
目前借鉴数字人对话系统[LinlyTalker](https://github.com/Kedreamix/Linly-Talker)的方式,LLM模型支持Chatgpt,Qwen和GeminiPro。需要在app.py中填入自己的api_key。
|
### 3.1 数字人模型
|
||||||
|
支持3种模型:ernerf、musetalk、wav2lip,默认用ernerf
|
||||||
用浏览器打开http://serverip:8010/rtcpushchat.html
|
#### 3.1.1 ER-Nerf
|
||||||
|
|
||||||
### 3.2 声音克隆
|
|
||||||
可以任意选用下面两种服务,推荐用gpt-sovits
|
|
||||||
#### 3.2.1 gpt-sovits
|
|
||||||
服务部署参照[gpt-sovits](/tts/README.md)
|
|
||||||
运行
|
|
||||||
```
|
```
|
||||||
python app.py --tts gpt-sovits --TTS_SERVER http://127.0.0.1:9880 --REF_FILE data/ref.wav --REF_TEXT xxx
|
python app.py --model ernerf
|
||||||
```
|
```
|
||||||
REF_TEXT为REF_FILE中语音内容,时长不宜过长
|
支持如下参数配置
|
||||||
|
##### 3.1.1.1 音频特征用hubert
|
||||||
#### 3.2.2 xtts
|
默认用的wav2lip,如果训练模型时用的hubert提取音频特征,用如下命令启动数字人
|
||||||
运行xtts服务,参照 https://github.com/coqui-ai/xtts-streaming-server
|
|
||||||
```
|
|
||||||
docker run --gpus=all -e COQUI_TOS_AGREED=1 --rm -p 9000:80 ghcr.io/coqui-ai/xtts-streaming-server:latest
|
|
||||||
```
|
|
||||||
然后运行,其中ref.wav为需要克隆的声音文件
|
|
||||||
```
|
|
||||||
python app.py --tts xtts --REF_FILE data/ref.wav --TTS_SERVER http://localhost:9000
|
|
||||||
```
|
|
||||||
|
|
||||||
### 3.3 音频特征用hubert
|
|
||||||
如果训练模型时用的hubert提取音频特征,用如下命令启动数字人
|
|
||||||
```
|
```
|
||||||
python app.py --asr_model facebook/hubert-large-ls960-ft
|
python app.py --asr_model facebook/hubert-large-ls960-ft
|
||||||
```
|
```
|
||||||
|
|
||||||
### 3.4 设置背景图片
|
##### 3.1.1.2 设置头部背景图片
|
||||||
```
|
```
|
||||||
python app.py --bg_img bc.jpg
|
python app.py --bg_img bc.jpg
|
||||||
```
|
```
|
||||||
|
|
||||||
### 3.5 全身视频拼接
|
##### 3.1.1.3 全身视频贴回
|
||||||
#### 3.5.1 切割训练用的视频
|
- 1.切割训练用的视频
|
||||||
```
|
```
|
||||||
ffmpeg -i fullbody.mp4 -vf crop="400:400:100:5" train.mp4
|
ffmpeg -i fullbody.mp4 -vf crop="400:400:100:5" train.mp4
|
||||||
```
|
```
|
||||||
用train.mp4训练模型
|
用train.mp4训练模型
|
||||||
#### 3.5.2 提取全身图片
|
- 2.提取全身图片
|
||||||
```
|
```
|
||||||
ffmpeg -i fullbody.mp4 -vf fps=25 -qmin 1 -q:v 1 -start_number 0 data/fullbody/img/%d.jpg
|
ffmpeg -i fullbody.mp4 -vf fps=25 -qmin 1 -q:v 1 -start_number 0 data/fullbody/img/%d.jpg
|
||||||
```
|
```
|
||||||
#### 3.5.2 启动数字人
|
- 3.启动数字人
|
||||||
```
|
```
|
||||||
python app.py --fullbody --fullbody_img data/fullbody/img --fullbody_offset_x 100 --fullbody_offset_y 5 --fullbody_width 580 --fullbody_height 1080 --W 400 --H 400
|
python app.py --fullbody --fullbody_img data/fullbody/img --fullbody_offset_x 100 --fullbody_offset_y 5 --fullbody_width 580 --fullbody_height 1080 --W 400 --H 400
|
||||||
```
|
```
|
||||||
|
@ -112,39 +95,7 @@ python app.py --fullbody --fullbody_img data/fullbody/img --fullbody_offset_x 10
|
||||||
- --W、--H 训练视频的宽、高
|
- --W、--H 训练视频的宽、高
|
||||||
- ernerf训练第三步torso如果训练的不好,在拼接处会有接缝。可以在上面的命令加上--torso_imgs data/xxx/torso_imgs,torso不用模型推理,直接用训练数据集里的torso图片。这种方式可能头颈处会有些人工痕迹。
|
- ernerf训练第三步torso如果训练的不好,在拼接处会有接缝。可以在上面的命令加上--torso_imgs data/xxx/torso_imgs,torso不用模型推理,直接用训练数据集里的torso图片。这种方式可能头颈处会有些人工痕迹。
|
||||||
|
|
||||||
### 3.6 不说话时用自定义视频替代
|
#### 3.1.2 模型用musetalk
|
||||||
- 提取自定义视频图片
|
|
||||||
```
|
|
||||||
ffmpeg -i silence.mp4 -vf fps=25 -qmin 1 -q:v 1 -start_number 0 data/customvideo/img/%d.png
|
|
||||||
```
|
|
||||||
- 运行数字人
|
|
||||||
```
|
|
||||||
python app.py --customvideo --customvideo_img data/customvideo/img --customvideo_imgnum 100
|
|
||||||
```
|
|
||||||
|
|
||||||
### 3.7 webrtc p2p
|
|
||||||
此种模式不需要srs
|
|
||||||
```
|
|
||||||
python app.py --transport webrtc
|
|
||||||
```
|
|
||||||
服务端需要开放端口 tcp:8010; udp:50000~60000
|
|
||||||
用浏览器打开http://serverip:8010/webrtcapi.html
|
|
||||||
|
|
||||||
### 3.8 rtmp推送到srs
|
|
||||||
- 安装rtmpstream库
|
|
||||||
参照 https://github.com/lipku/python_rtmpstream
|
|
||||||
|
|
||||||
- 启动srs
|
|
||||||
```
|
|
||||||
docker run --rm -it -p 1935:1935 -p 1985:1985 -p 8080:8080 registry.cn-hangzhou.aliyuncs.com/ossrs/srs:5
|
|
||||||
```
|
|
||||||
- 运行数字人
|
|
||||||
```python
|
|
||||||
python app.py --transport rtmp --push_url 'rtmp://localhost/live/livestream'
|
|
||||||
```
|
|
||||||
用浏览器打开http://serverip:8010/echoapi.html
|
|
||||||
|
|
||||||
### 3.9 模型用musetalk
|
|
||||||
暂不支持rtmp推送
|
暂不支持rtmp推送
|
||||||
- 安装依赖库
|
- 安装依赖库
|
||||||
```bash
|
```bash
|
||||||
|
@ -163,7 +114,7 @@ mim install "mmpose>=1.1.0"
|
||||||
python app.py --model musetalk --transport webrtc
|
python app.py --model musetalk --transport webrtc
|
||||||
用浏览器打开http://serverip:8010/webrtcapi.html
|
用浏览器打开http://serverip:8010/webrtcapi.html
|
||||||
可以设置--batch_size 提高显卡利用率,设置--avatar_id 运行不同的数字人
|
可以设置--batch_size 提高显卡利用率,设置--avatar_id 运行不同的数字人
|
||||||
#### 替换成自己的数字人
|
##### 替换成自己的数字人
|
||||||
```bash
|
```bash
|
||||||
git clone https://github.com/TMElyralab/MuseTalk.git
|
git clone https://github.com/TMElyralab/MuseTalk.git
|
||||||
cd MuseTalk
|
cd MuseTalk
|
||||||
|
@ -177,7 +128,7 @@ python simple_musetalk.py --avatar_id 4 --file D:\\ok\\test.mp4
|
||||||
支持视频和图片生成 会自动生成到data的avatars目录下
|
支持视频和图片生成 会自动生成到data的avatars目录下
|
||||||
```
|
```
|
||||||
|
|
||||||
### 3.10 模型用wav2lip
|
#### 3.1.3 模型用wav2lip
|
||||||
暂不支持rtmp推送
|
暂不支持rtmp推送
|
||||||
- 下载模型
|
- 下载模型
|
||||||
下载wav2lip运行需要的模型,链接: https://pan.baidu.com/s/1yOsQ06-RIDTJd3HFCw4wtA 密码: ltua
|
下载wav2lip运行需要的模型,链接: https://pan.baidu.com/s/1yOsQ06-RIDTJd3HFCw4wtA 密码: ltua
|
||||||
|
@ -187,12 +138,96 @@ python simple_musetalk.py --avatar_id 4 --file D:\\ok\\test.mp4
|
||||||
python app.py --transport webrtc --model wav2lip --avatar_id wav2lip_avatar1
|
python app.py --transport webrtc --model wav2lip --avatar_id wav2lip_avatar1
|
||||||
用浏览器打开http://serverip:8010/webrtcapi.html
|
用浏览器打开http://serverip:8010/webrtcapi.html
|
||||||
可以设置--batch_size 提高显卡利用率,设置--avatar_id 运行不同的数字人
|
可以设置--batch_size 提高显卡利用率,设置--avatar_id 运行不同的数字人
|
||||||
#### 替换成自己的数字人
|
##### 替换成自己的数字人
|
||||||
```bash
|
```bash
|
||||||
cd wav2lip
|
cd wav2lip
|
||||||
python genavatar.py --video_path xxx.mp4
|
python genavatar.py --video_path xxx.mp4
|
||||||
运行后将results/avatars下文件拷到本项目的data/avatars下
|
运行后将results/avatars下文件拷到本项目的data/avatars下
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### 3.2 传输模式
|
||||||
|
支持webrtc、rtcpush、rtmp,默认用rtcpush
|
||||||
|
#### 3.2.1 webrtc p2p
|
||||||
|
此种模式不需要srs
|
||||||
|
```
|
||||||
|
python app.py --transport webrtc
|
||||||
|
```
|
||||||
|
服务端需要开放端口 tcp:8010; udp:50000~60000
|
||||||
|
用浏览器打开http://serverip:8010/webrtcapi.html
|
||||||
|
|
||||||
|
#### 3.2.2 webrtc推送到srs
|
||||||
|
- 启动srs
|
||||||
|
```
|
||||||
|
export CANDIDATE='<服务器外网ip>'
|
||||||
|
docker run --rm --env CANDIDATE=$CANDIDATE \
|
||||||
|
-p 1935:1935 -p 8080:8080 -p 1985:1985 -p 8000:8000/udp \
|
||||||
|
registry.cn-hangzhou.aliyuncs.com/ossrs/srs:5 \
|
||||||
|
objs/srs -c conf/rtc.conf
|
||||||
|
```
|
||||||
|
- 运行数字人
|
||||||
|
```python
|
||||||
|
python app.py --transport rtcpush --push_url 'http://localhost:1985/rtc/v1/whip/?app=live&stream=livestream'
|
||||||
|
```
|
||||||
|
用浏览器打开http://serverip:8010/rtcpushapi.html
|
||||||
|
|
||||||
|
#### 3.2.3 rtmp推送到srs
|
||||||
|
- 安装rtmpstream库
|
||||||
|
参照 https://github.com/lipku/python_rtmpstream
|
||||||
|
|
||||||
|
- 启动srs
|
||||||
|
```
|
||||||
|
docker run --rm -it -p 1935:1935 -p 1985:1985 -p 8080:8080 registry.cn-hangzhou.aliyuncs.com/ossrs/srs:5
|
||||||
|
```
|
||||||
|
- 运行数字人
|
||||||
|
```python
|
||||||
|
python app.py --transport rtmp --push_url 'rtmp://localhost/live/livestream'
|
||||||
|
```
|
||||||
|
用浏览器打开http://serverip:8010/echoapi.html
|
||||||
|
|
||||||
|
### 3.3 TTS模型
|
||||||
|
支持edgetts、gpt-sovits、xtts,默认用edgetts
|
||||||
|
#### 3.3.1 gpt-sovits
|
||||||
|
服务部署参照[gpt-sovits](/tts/README.md)
|
||||||
|
运行
|
||||||
|
```
|
||||||
|
python app.py --tts gpt-sovits --TTS_SERVER http://127.0.0.1:9880 --REF_FILE data/ref.wav --REF_TEXT xxx
|
||||||
|
```
|
||||||
|
REF_TEXT为REF_FILE中语音内容,时长不宜过长
|
||||||
|
|
||||||
|
#### 3.3.2 xtts
|
||||||
|
运行xtts服务,参照 https://github.com/coqui-ai/xtts-streaming-server
|
||||||
|
```
|
||||||
|
docker run --gpus=all -e COQUI_TOS_AGREED=1 --rm -p 9000:80 ghcr.io/coqui-ai/xtts-streaming-server:latest
|
||||||
|
```
|
||||||
|
然后运行,其中ref.wav为需要克隆的声音文件
|
||||||
|
```
|
||||||
|
python app.py --tts xtts --REF_FILE data/ref.wav --TTS_SERVER http://localhost:9000
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3.4 视频编排
|
||||||
|
- 1,生成素材
|
||||||
|
```
|
||||||
|
ffmpeg -i xxx.mp4 -s 576x768 -vf fps=25 -qmin 1 -q:v 1 -start_number 0 data/customvideo/image/%08d.png
|
||||||
|
ffmpeg -i xxx.mp4 -vn -acodec pcm_s16le -ac 1 -ar 16000 data/customvideo/audio.wav
|
||||||
|
```
|
||||||
|
其中-s与输出视频大小一致
|
||||||
|
- 2,编辑data/custom_config.json
|
||||||
|
指定imgpath和audiopath。
|
||||||
|
设置audiotype,说明:0表示推理视频,不用设置;1表示静音视频,如果不设置默认用推理视频代替; 2以上自定义配置
|
||||||
|
- 3,运行
|
||||||
|
```
|
||||||
|
python app.py --transport webrtc --customvideo_config data/custom_config.json
|
||||||
|
```
|
||||||
|
- 4,打开http://<serverip>:8010/webrtcapi-custom.html
|
||||||
|
填写custom_config.json中配置的audiotype,点击切换视频
|
||||||
|
|
||||||
|
### 3.5 使用LLM模型进行数字人对话
|
||||||
|
|
||||||
|
目前借鉴数字人对话系统[LinlyTalker](https://github.com/Kedreamix/Linly-Talker)的方式,LLM模型支持Chatgpt,Qwen和GeminiPro。需要在app.py中填入自己的api_key。
|
||||||
|
|
||||||
|
用浏览器打开http://serverip:8010/rtcpushchat.html
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
## 4. Docker Run
|
## 4. Docker Run
|
||||||
不需要前面的安装,直接运行。
|
不需要前面的安装,直接运行。
|
||||||
|
@ -206,7 +241,7 @@ https://www.codewithgpu.com/i/lipku/metahuman-stream/base
|
||||||
[autodl教程](autodl/README.md)
|
[autodl教程](autodl/README.md)
|
||||||
|
|
||||||
|
|
||||||
## 5. 数字人模型文件
|
## 5. ernerf数字人模型文件
|
||||||
可以替换成自己训练的模型(https://github.com/Fictionarry/ER-NeRF)
|
可以替换成自己训练的模型(https://github.com/Fictionarry/ER-NeRF)
|
||||||
```python
|
```python
|
||||||
.
|
.
|
||||||
|
|
6
app.py
6
app.py
|
@ -316,9 +316,9 @@ if __name__ == '__main__':
|
||||||
parser.add_argument('--bbox_shift', type=int, default=5)
|
parser.add_argument('--bbox_shift', type=int, default=5)
|
||||||
parser.add_argument('--batch_size', type=int, default=16)
|
parser.add_argument('--batch_size', type=int, default=16)
|
||||||
|
|
||||||
parser.add_argument('--customvideo', action='store_true', help="custom video")
|
# parser.add_argument('--customvideo', action='store_true', help="custom video")
|
||||||
parser.add_argument('--customvideo_img', type=str, default='data/customvideo/img')
|
# parser.add_argument('--customvideo_img', type=str, default='data/customvideo/img')
|
||||||
parser.add_argument('--customvideo_imgnum', type=int, default=1)
|
# parser.add_argument('--customvideo_imgnum', type=int, default=1)
|
||||||
|
|
||||||
parser.add_argument('--customvideo_config', type=str, default='')
|
parser.add_argument('--customvideo_config', type=str, default='')
|
||||||
|
|
||||||
|
|
21
basereal.py
21
basereal.py
|
@ -15,6 +15,8 @@ from threading import Thread, Event
|
||||||
from io import BytesIO
|
from io import BytesIO
|
||||||
import soundfile as sf
|
import soundfile as sf
|
||||||
|
|
||||||
|
from ttsreal import EdgeTTS,VoitsTTS,XTTS
|
||||||
|
|
||||||
from tqdm import tqdm
|
from tqdm import tqdm
|
||||||
def read_imgs(img_list):
|
def read_imgs(img_list):
|
||||||
frames = []
|
frames = []
|
||||||
|
@ -30,6 +32,13 @@ class BaseReal:
|
||||||
self.sample_rate = 16000
|
self.sample_rate = 16000
|
||||||
self.chunk = self.sample_rate // opt.fps # 320 samples per chunk (20ms * 16000 / 1000)
|
self.chunk = self.sample_rate // opt.fps # 320 samples per chunk (20ms * 16000 / 1000)
|
||||||
|
|
||||||
|
if opt.tts == "edgetts":
|
||||||
|
self.tts = EdgeTTS(opt,self)
|
||||||
|
elif opt.tts == "gpt-sovits":
|
||||||
|
self.tts = VoitsTTS(opt,self)
|
||||||
|
elif opt.tts == "xtts":
|
||||||
|
self.tts = XTTS(opt,self)
|
||||||
|
|
||||||
self.curr_state=0
|
self.curr_state=0
|
||||||
self.custom_img_cycle = {}
|
self.custom_img_cycle = {}
|
||||||
self.custom_audio_cycle = {}
|
self.custom_audio_cycle = {}
|
||||||
|
@ -48,7 +57,14 @@ class BaseReal:
|
||||||
self.custom_audio_index[item['audiotype']] = 0
|
self.custom_audio_index[item['audiotype']] = 0
|
||||||
self.custom_index[item['audiotype']] = 0
|
self.custom_index[item['audiotype']] = 0
|
||||||
self.custom_opt[item['audiotype']] = item
|
self.custom_opt[item['audiotype']] = item
|
||||||
|
|
||||||
|
def init_customindex(self):
|
||||||
|
self.curr_state=0
|
||||||
|
for key in self.custom_audio_index:
|
||||||
|
self.custom_audio_index[key]=0
|
||||||
|
for key in self.custom_index:
|
||||||
|
self.custom_index[key]=0
|
||||||
|
|
||||||
def mirror_index(self,size, index):
|
def mirror_index(self,size, index):
|
||||||
#size = len(self.coord_list_cycle)
|
#size = len(self.coord_list_cycle)
|
||||||
turn = index // size
|
turn = index // size
|
||||||
|
@ -62,11 +78,12 @@ class BaseReal:
|
||||||
idx = self.custom_audio_index[audiotype]
|
idx = self.custom_audio_index[audiotype]
|
||||||
stream = self.custom_audio_cycle[audiotype][idx:idx+self.chunk]
|
stream = self.custom_audio_cycle[audiotype][idx:idx+self.chunk]
|
||||||
self.custom_audio_index[audiotype] += self.chunk
|
self.custom_audio_index[audiotype] += self.chunk
|
||||||
if self.custom_audio_index[audiotype]>=stream.shape[0]:
|
if self.custom_audio_index[audiotype]>=self.custom_audio_cycle[audiotype].shape[0]:
|
||||||
self.curr_state = 1 #当前视频不循环播放,切换到静音状态
|
self.curr_state = 1 #当前视频不循环播放,切换到静音状态
|
||||||
return stream
|
return stream
|
||||||
|
|
||||||
def set_curr_state(self,audiotype, reinit):
|
def set_curr_state(self,audiotype, reinit):
|
||||||
|
print('set_curr_state:',audiotype)
|
||||||
self.curr_state = audiotype
|
self.curr_state = audiotype
|
||||||
if reinit:
|
if reinit:
|
||||||
self.custom_audio_index[audiotype] = 0
|
self.custom_audio_index[audiotype] = 0
|
||||||
|
|
|
@ -166,12 +166,6 @@ class LipReal(BaseReal):
|
||||||
|
|
||||||
self.asr = LipASR(opt,self)
|
self.asr = LipASR(opt,self)
|
||||||
self.asr.warm_up()
|
self.asr.warm_up()
|
||||||
if opt.tts == "edgetts":
|
|
||||||
self.tts = EdgeTTS(opt,self)
|
|
||||||
elif opt.tts == "gpt-sovits":
|
|
||||||
self.tts = VoitsTTS(opt,self)
|
|
||||||
elif opt.tts == "xtts":
|
|
||||||
self.tts = XTTS(opt,self)
|
|
||||||
#self.__warm_up()
|
#self.__warm_up()
|
||||||
|
|
||||||
self.render_event = mp.Event()
|
self.render_event = mp.Event()
|
||||||
|
@ -257,6 +251,7 @@ class LipReal(BaseReal):
|
||||||
# self.asr.warm_up()
|
# self.asr.warm_up()
|
||||||
|
|
||||||
self.tts.render(quit_event)
|
self.tts.render(quit_event)
|
||||||
|
self.init_customindex()
|
||||||
process_thread = Thread(target=self.process_frames, args=(quit_event,loop,audio_track,video_track))
|
process_thread = Thread(target=self.process_frames, args=(quit_event,loop,audio_track,video_track))
|
||||||
process_thread.start()
|
process_thread.start()
|
||||||
|
|
||||||
|
|
|
@ -8,8 +8,8 @@ from baseasr import BaseASR
|
||||||
from musetalk.whisper.audio2feature import Audio2Feature
|
from musetalk.whisper.audio2feature import Audio2Feature
|
||||||
|
|
||||||
class MuseASR(BaseASR):
|
class MuseASR(BaseASR):
|
||||||
def __init__(self, opt, audio_processor:Audio2Feature):
|
def __init__(self, opt, parent,audio_processor:Audio2Feature):
|
||||||
super().__init__(opt)
|
super().__init__(opt,parent)
|
||||||
self.audio_processor = audio_processor
|
self.audio_processor = audio_processor
|
||||||
|
|
||||||
def run_step(self):
|
def run_step(self):
|
||||||
|
|
27
musereal.py
27
musereal.py
|
@ -27,6 +27,7 @@ from ttsreal import EdgeTTS,VoitsTTS,XTTS
|
||||||
from museasr import MuseASR
|
from museasr import MuseASR
|
||||||
import asyncio
|
import asyncio
|
||||||
from av import AudioFrame, VideoFrame
|
from av import AudioFrame, VideoFrame
|
||||||
|
from basereal import BaseReal
|
||||||
|
|
||||||
from tqdm import tqdm
|
from tqdm import tqdm
|
||||||
def read_imgs(img_list):
|
def read_imgs(img_list):
|
||||||
|
@ -125,9 +126,10 @@ def inference(render_event,batch_size,latents_out_path,audio_feat_queue,audio_ou
|
||||||
print('musereal inference processor stop')
|
print('musereal inference processor stop')
|
||||||
|
|
||||||
@torch.no_grad()
|
@torch.no_grad()
|
||||||
class MuseReal:
|
class MuseReal(BaseReal):
|
||||||
def __init__(self, opt):
|
def __init__(self, opt):
|
||||||
self.opt = opt # shared with the trainer's opt to support in-place modification of rendering parameters.
|
super().__init__(opt)
|
||||||
|
#self.opt = opt # shared with the trainer's opt to support in-place modification of rendering parameters.
|
||||||
self.W = opt.W
|
self.W = opt.W
|
||||||
self.H = opt.H
|
self.H = opt.H
|
||||||
|
|
||||||
|
@ -156,14 +158,8 @@ class MuseReal:
|
||||||
self.__loadmodels()
|
self.__loadmodels()
|
||||||
self.__loadavatar()
|
self.__loadavatar()
|
||||||
|
|
||||||
self.asr = MuseASR(opt,self.audio_processor)
|
self.asr = MuseASR(opt,self,self.audio_processor)
|
||||||
self.asr.warm_up()
|
self.asr.warm_up()
|
||||||
if opt.tts == "edgetts":
|
|
||||||
self.tts = EdgeTTS(opt,self)
|
|
||||||
elif opt.tts == "gpt-sovits":
|
|
||||||
self.tts = VoitsTTS(opt,self)
|
|
||||||
elif opt.tts == "xtts":
|
|
||||||
self.tts = XTTS(opt,self)
|
|
||||||
#self.__warm_up()
|
#self.__warm_up()
|
||||||
|
|
||||||
self.render_event = mp.Event()
|
self.render_event = mp.Event()
|
||||||
|
@ -246,8 +242,16 @@ class MuseReal:
|
||||||
res_frame,idx,audio_frames = self.res_frame_queue.get(block=True, timeout=1)
|
res_frame,idx,audio_frames = self.res_frame_queue.get(block=True, timeout=1)
|
||||||
except queue.Empty:
|
except queue.Empty:
|
||||||
continue
|
continue
|
||||||
if audio_frames[0][1]==1 and audio_frames[1][1]==1: #全为静音数据,只需要取fullimg
|
if audio_frames[0][1]!=0 and audio_frames[1][1]!=0: #全为静音数据,只需要取fullimg
|
||||||
combine_frame = self.frame_list_cycle[idx]
|
audiotype = audio_frames[0][1]
|
||||||
|
if self.custom_index.get(audiotype) is not None: #有自定义视频
|
||||||
|
mirindex = self.mirror_index(len(self.custom_img_cycle[audiotype]),self.custom_index[audiotype])
|
||||||
|
combine_frame = self.custom_img_cycle[audiotype][mirindex]
|
||||||
|
self.custom_index[audiotype] += 1
|
||||||
|
# if not self.custom_opt[audiotype].loop and self.custom_index[audiotype]>=len(self.custom_img_cycle[audiotype]):
|
||||||
|
# self.curr_state = 1 #当前视频不循环播放,切换到静音状态
|
||||||
|
else:
|
||||||
|
combine_frame = self.frame_list_cycle[idx]
|
||||||
else:
|
else:
|
||||||
bbox = self.coord_list_cycle[idx]
|
bbox = self.coord_list_cycle[idx]
|
||||||
ori_frame = copy.deepcopy(self.frame_list_cycle[idx])
|
ori_frame = copy.deepcopy(self.frame_list_cycle[idx])
|
||||||
|
@ -283,6 +287,7 @@ class MuseReal:
|
||||||
# self.asr.warm_up()
|
# self.asr.warm_up()
|
||||||
|
|
||||||
self.tts.render(quit_event)
|
self.tts.render(quit_event)
|
||||||
|
self.init_customindex()
|
||||||
process_thread = Thread(target=self.process_frames, args=(quit_event,loop,audio_track,video_track))
|
process_thread = Thread(target=self.process_frames, args=(quit_event,loop,audio_track,video_track))
|
||||||
process_thread.start()
|
process_thread.start()
|
||||||
|
|
||||||
|
|
|
@ -12,9 +12,9 @@ from threading import Thread, Event
|
||||||
|
|
||||||
from baseasr import BaseASR
|
from baseasr import BaseASR
|
||||||
|
|
||||||
class ASR(BaseASR):
|
class NerfASR(BaseASR):
|
||||||
def __init__(self, opt):
|
def __init__(self, opt, parent):
|
||||||
super().__init__(opt)
|
super().__init__(opt,parent)
|
||||||
|
|
||||||
self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
|
self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
|
||||||
if 'esperanto' in self.opt.asr_model:
|
if 'esperanto' in self.opt.asr_model:
|
||||||
|
@ -66,8 +66,12 @@ class ASR(BaseASR):
|
||||||
type = 0
|
type = 0
|
||||||
#print(f'[INFO] get frame {frame.shape}')
|
#print(f'[INFO] get frame {frame.shape}')
|
||||||
except queue.Empty:
|
except queue.Empty:
|
||||||
frame = np.zeros(self.chunk, dtype=np.float32)
|
if self.parent and self.parent.curr_state>1: #播放自定义音频
|
||||||
type = 1
|
frame = self.parent.get_audio_stream(self.parent.curr_state)
|
||||||
|
type = self.parent.curr_state
|
||||||
|
else:
|
||||||
|
frame = np.zeros(self.chunk, dtype=np.float32)
|
||||||
|
type = 1
|
||||||
|
|
||||||
return frame,type
|
return frame,type
|
||||||
|
|
94
nerfreal.py
94
nerfreal.py
|
@ -9,15 +9,17 @@ import time
|
||||||
import torch.nn.functional as F
|
import torch.nn.functional as F
|
||||||
import cv2
|
import cv2
|
||||||
|
|
||||||
from asrreal import ASR
|
from nerfasr import NerfASR
|
||||||
from ttsreal import EdgeTTS,VoitsTTS,XTTS
|
from ttsreal import EdgeTTS,VoitsTTS,XTTS
|
||||||
|
|
||||||
import asyncio
|
import asyncio
|
||||||
from av import AudioFrame, VideoFrame
|
from av import AudioFrame, VideoFrame
|
||||||
|
from basereal import BaseReal
|
||||||
|
|
||||||
class NeRFReal:
|
class NeRFReal(BaseReal):
|
||||||
def __init__(self, opt, trainer, data_loader, debug=True):
|
def __init__(self, opt, trainer, data_loader, debug=True):
|
||||||
self.opt = opt # shared with the trainer's opt to support in-place modification of rendering parameters.
|
super().__init__(opt)
|
||||||
|
#self.opt = opt # shared with the trainer's opt to support in-place modification of rendering parameters.
|
||||||
self.W = opt.W
|
self.W = opt.W
|
||||||
self.H = opt.H
|
self.H = opt.H
|
||||||
|
|
||||||
|
@ -55,17 +57,11 @@ class NeRFReal:
|
||||||
#self.ind_index = 0
|
#self.ind_index = 0
|
||||||
#self.ind_num = trainer.model.individual_codes.shape[0]
|
#self.ind_num = trainer.model.individual_codes.shape[0]
|
||||||
|
|
||||||
self.customimg_index = 0
|
#self.customimg_index = 0
|
||||||
|
|
||||||
# build asr
|
# build asr
|
||||||
self.asr = ASR(opt)
|
self.asr = NerfASR(opt,self)
|
||||||
self.asr.warm_up()
|
self.asr.warm_up()
|
||||||
if opt.tts == "edgetts":
|
|
||||||
self.tts = EdgeTTS(opt,self)
|
|
||||||
elif opt.tts == "gpt-sovits":
|
|
||||||
self.tts = VoitsTTS(opt,self)
|
|
||||||
elif opt.tts == "xtts":
|
|
||||||
self.tts = XTTS(opt,self)
|
|
||||||
|
|
||||||
'''
|
'''
|
||||||
video_path = 'video_stream'
|
video_path = 'video_stream'
|
||||||
|
@ -124,14 +120,14 @@ class NeRFReal:
|
||||||
self.asr.pause_talk()
|
self.asr.pause_talk()
|
||||||
|
|
||||||
|
|
||||||
def mirror_index(self, index):
|
# def mirror_index(self, index):
|
||||||
size = self.opt.customvideo_imgnum
|
# size = self.opt.customvideo_imgnum
|
||||||
turn = index // size
|
# turn = index // size
|
||||||
res = index % size
|
# res = index % size
|
||||||
if turn % 2 == 0:
|
# if turn % 2 == 0:
|
||||||
return res
|
# return res
|
||||||
else:
|
# else:
|
||||||
return size - res - 1
|
# return size - res - 1
|
||||||
|
|
||||||
def test_step(self,loop=None,audio_track=None,video_track=None):
|
def test_step(self,loop=None,audio_track=None,video_track=None):
|
||||||
|
|
||||||
|
@ -148,39 +144,57 @@ class NeRFReal:
|
||||||
# use the live audio stream
|
# use the live audio stream
|
||||||
data['auds'] = self.asr.get_next_feat()
|
data['auds'] = self.asr.get_next_feat()
|
||||||
|
|
||||||
audiotype = 0
|
audiotype1 = 0
|
||||||
if self.opt.transport=='rtmp':
|
audiotype2 = 0
|
||||||
for _ in range(2):
|
#send audio
|
||||||
frame,type = self.asr.get_audio_out()
|
for i in range(2):
|
||||||
audiotype += type
|
frame,type = self.asr.get_audio_out()
|
||||||
#print(f'[INFO] get_audio_out shape ',frame.shape)
|
if i==0:
|
||||||
|
audiotype1 = type
|
||||||
|
else:
|
||||||
|
audiotype2 = type
|
||||||
|
#print(f'[INFO] get_audio_out shape ',frame.shape)
|
||||||
|
if self.opt.transport=='rtmp':
|
||||||
self.streamer.stream_frame_audio(frame)
|
self.streamer.stream_frame_audio(frame)
|
||||||
else:
|
else: #webrtc
|
||||||
for _ in range(2):
|
|
||||||
frame,type = self.asr.get_audio_out()
|
|
||||||
audiotype += type
|
|
||||||
frame = (frame * 32767).astype(np.int16)
|
frame = (frame * 32767).astype(np.int16)
|
||||||
new_frame = AudioFrame(format='s16', layout='mono', samples=frame.shape[0])
|
new_frame = AudioFrame(format='s16', layout='mono', samples=frame.shape[0])
|
||||||
new_frame.planes[0].update(frame.tobytes())
|
new_frame.planes[0].update(frame.tobytes())
|
||||||
new_frame.sample_rate=16000
|
new_frame.sample_rate=16000
|
||||||
# if audio_track._queue.qsize()>10:
|
asyncio.run_coroutine_threadsafe(audio_track._queue.put(new_frame), loop)
|
||||||
# time.sleep(0.1)
|
|
||||||
asyncio.run_coroutine_threadsafe(audio_track._queue.put(new_frame), loop)
|
# if self.opt.transport=='rtmp':
|
||||||
|
# for _ in range(2):
|
||||||
|
# frame,type = self.asr.get_audio_out()
|
||||||
|
# audiotype += type
|
||||||
|
# #print(f'[INFO] get_audio_out shape ',frame.shape)
|
||||||
|
# self.streamer.stream_frame_audio(frame)
|
||||||
|
# else: #webrtc
|
||||||
|
# for _ in range(2):
|
||||||
|
# frame,type = self.asr.get_audio_out()
|
||||||
|
# audiotype += type
|
||||||
|
# frame = (frame * 32767).astype(np.int16)
|
||||||
|
# new_frame = AudioFrame(format='s16', layout='mono', samples=frame.shape[0])
|
||||||
|
# new_frame.planes[0].update(frame.tobytes())
|
||||||
|
# new_frame.sample_rate=16000
|
||||||
|
# # if audio_track._queue.qsize()>10:
|
||||||
|
# # time.sleep(0.1)
|
||||||
|
# asyncio.run_coroutine_threadsafe(audio_track._queue.put(new_frame), loop)
|
||||||
#t = time.time()
|
#t = time.time()
|
||||||
if self.opt.customvideo and audiotype!=0:
|
if audiotype1!=0 and audiotype2!=0 and self.custom_index.get(audiotype1) is not None: #不为推理视频并且有自定义视频
|
||||||
self.loader = iter(self.data_loader) #init
|
mirindex = self.mirror_index(len(self.custom_img_cycle[audiotype1]),self.custom_index[audiotype1])
|
||||||
imgindex = self.mirror_index(self.customimg_index)
|
#imgindex = self.mirror_index(self.customimg_index)
|
||||||
#print('custom img index:',imgindex)
|
#print('custom img index:',imgindex)
|
||||||
image = cv2.imread(os.path.join(self.opt.customvideo_img, str(int(imgindex))+'.png'))
|
#image = cv2.imread(os.path.join(self.opt.customvideo_img, str(int(imgindex))+'.png'))
|
||||||
|
image = self.custom_img_cycle[audiotype1][mirindex]
|
||||||
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
|
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
|
||||||
|
self.custom_index[audiotype1] += 1
|
||||||
if self.opt.transport=='rtmp':
|
if self.opt.transport=='rtmp':
|
||||||
self.streamer.stream_frame(image)
|
self.streamer.stream_frame(image)
|
||||||
else:
|
else:
|
||||||
new_frame = VideoFrame.from_ndarray(image, format="rgb24")
|
new_frame = VideoFrame.from_ndarray(image, format="rgb24")
|
||||||
asyncio.run_coroutine_threadsafe(video_track._queue.put(new_frame), loop)
|
asyncio.run_coroutine_threadsafe(video_track._queue.put(new_frame), loop)
|
||||||
self.customimg_index += 1
|
else: #推理视频+贴回
|
||||||
else:
|
|
||||||
self.customimg_index = 0
|
|
||||||
outputs = self.trainer.test_gui_with_data(data, self.W, self.H)
|
outputs = self.trainer.test_gui_with_data(data, self.W, self.H)
|
||||||
#print('-------ernerf time: ',time.time()-t)
|
#print('-------ernerf time: ',time.time()-t)
|
||||||
#print(f'[INFO] outputs shape ',outputs['image'].shape)
|
#print(f'[INFO] outputs shape ',outputs['image'].shape)
|
||||||
|
@ -213,6 +227,8 @@ class NeRFReal:
|
||||||
#if self.opt.asr:
|
#if self.opt.asr:
|
||||||
# self.asr.warm_up()
|
# self.asr.warm_up()
|
||||||
|
|
||||||
|
self.init_customindex()
|
||||||
|
|
||||||
if self.opt.transport=='rtmp':
|
if self.opt.transport=='rtmp':
|
||||||
from rtmp_streaming import StreamerConfig, Streamer
|
from rtmp_streaming import StreamerConfig, Streamer
|
||||||
fps=25
|
fps=25
|
||||||
|
|
|
@ -54,7 +54,20 @@
|
||||||
<script type="text/javascript" src="https://ajax.aspnetcdn.com/ajax/jquery/jquery-2.1.1.min.js"></script>
|
<script type="text/javascript" src="https://ajax.aspnetcdn.com/ajax/jquery/jquery-2.1.1.min.js"></script>
|
||||||
</body>
|
</body>
|
||||||
<script type="text/javascript" charset="utf-8">
|
<script type="text/javascript" charset="utf-8">
|
||||||
|
function custom() {
|
||||||
|
fetch('/set_audiotype', {
|
||||||
|
body: JSON.stringify({
|
||||||
|
audiotype: parseInt(document.getElementById('audiotype').value),
|
||||||
|
reinit: false,
|
||||||
|
sessionid:parseInt(document.getElementById('sessionid').value),
|
||||||
|
}),
|
||||||
|
headers: {
|
||||||
|
'Content-Type': 'application/json'
|
||||||
|
},
|
||||||
|
method: 'POST'
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
$(document).ready(function() {
|
$(document).ready(function() {
|
||||||
// var host = window.location.hostname
|
// var host = window.location.hostname
|
||||||
// var ws = new WebSocket("ws://"+host+":8000/humanecho");
|
// var ws = new WebSocket("ws://"+host+":8000/humanecho");
|
||||||
|
@ -94,20 +107,6 @@
|
||||||
//ws.send(message);
|
//ws.send(message);
|
||||||
$('#message').val('');
|
$('#message').val('');
|
||||||
});
|
});
|
||||||
|
|
||||||
function custom() {
|
|
||||||
fetch('/set_audiotype', {
|
|
||||||
body: JSON.stringify({
|
|
||||||
audiotype: parseInt(document.getElementById('audiotype').value),
|
|
||||||
reinit: false,
|
|
||||||
sessionid:parseInt(document.getElementById('sessionid').value),
|
|
||||||
}),
|
|
||||||
headers: {
|
|
||||||
'Content-Type': 'application/json'
|
|
||||||
},
|
|
||||||
method: 'POST'
|
|
||||||
});
|
|
||||||
}
|
|
||||||
});
|
});
|
||||||
</script>
|
</script>
|
||||||
</html>
|
</html>
|
||||||
|
|
Loading…
Reference in New Issue