完善数字人播报

2023-12-28 13:11:18 +08:00 · 2023-12-28 13:11:18 +08:00 · bdd4be3919
parent f0b46ddb8e
commit bdd4be3919
9 changed files with 159 additions and 202 deletions
--- a/README.md
+++ b/README.md
@ -1,121 +1,79 @@
-# 虚拟人说话头生成(照片虚拟人实时驱动)
-![](/img/example.gif)
-# Get Started
+A streaming digital human based on the Ernerf model， realize audio video synchronous dialogue. It can basically achieve commercial effects.  
+基于ernerf模型的流式数字人，实现音视频同步对话。基本可以达到商用效果
+

 ## Installation

-Tested on Ubuntu 22.04, Pytorch 1.12 and CUDA 11.6，or  Pytorch 1.12 and CUDA 11.3
-
-```python
-git clone https://github.com/waityousea/xuniren.git
-cd xuniren
-```
+Tested on Ubuntu 18.04, Pytorch 1.12 and CUDA 11.3.

 ### Install dependency

-```python
-# for ubuntu, portaudio is needed for pyaudio to work.
-sudo apt install portaudio19-dev
-
+```bash
+pip install pytorch==1.12.1 torchvision==0.13.1 cudatoolkit=11.3 -c pytorch
 pip install -r requirements.txt
-or
-## environment.yml中的pytorch使用的1.12和cuda 11.3
-conda env create -f environment.yml 
-## install pytorch3d
-#ubuntu/mac
 pip install "git+https://github.com/facebookresearch/pytorch3d.git"
+pip install tensorflow-gpu==2.8.0
+```
+linux cuda环境搭建可以参考这篇文章 https://zhuanlan.zhihu.com/p/674972886
+
+安装rtmpstream库  
+参照 https://github.com/lipku/python_rtmpstream
+
+
+## Run
+
+### 运行rtmpserver (srs)
+```
+docker run --rm -it -p 1935:1935 -p 1985:1985 -p 8080:8080 registry.cn-hangzhou.aliyuncs.com/ossrs/srs:5
 ```

-**windows安装pytorch3d**
-
- gcc & g++ ≥ 4.9
-
-在windows中，需要安装gcc编译器，可以根据需求自行安装，例如采用MinGW
-
-以下安装步骤来自于[pytorch3d](https://github.com/facebookresearch/pytorch3d/blob/main/INSTALL.md)官方, 可以根据需求进行选择。
-
-```python
-conda create -n pytorch3d python=3.9
-conda activate pytorch3d
-conda install pytorch=1.13.0 torchvision pytorch-cuda=11.6 -c pytorch -c nvidia
-conda install -c fvcore -c iopath -c conda-forge fvcore iopath
-```
-
-对于 CUB 构建时间依赖项，仅当您的 CUDA 早于 11.7 时才需要，如果您使用的是 conda，则可以继续
-
-```
-conda install -c bottler nvidiacub
-```
-
-```
-# Demos and examples
-conda install jupyter
-pip install scikit-image matplotlib imageio plotly opencv-python
-
-# Tests/Linting
-pip install black usort flake8 flake8-bugbear flake8-comprehensions
-```
-
-任何必要的补丁后，你可以去“x64 Native Tools Command Prompt for VS 2019”编译安装
-
-```
-git clone https://github.com/facebookresearch/pytorch3d.git
-cd pytorch3d
-python setup.py install
-```
-
-### Build extension 
-
-By default, we use [`load`](https://pytorch.org/docs/stable/cpp_extension.html#torch.utils.cpp_extension.load) to build the extension at runtime. However, this may be inconvenient sometimes. Therefore, we also provide the `setup.py` to build each extension:
-
-```
-# install all extension modules
-# notice: 该模块必须安装。
-# 在windows下，建议采用vs2019的x64 Native Tools Command Prompt for VS 2019命令窗口安装
-bash scripts/install_ext.sh
-```
-
-### **start(独立运行)**
-
-环境配置完成后，启动虚拟人生成器：
+### 环境配置完成后，启动：

 ```python
 python app.py
 ```
-### **start（对接fay，在ubuntu 20下完成测试）**
-环境配置完成后，启动fay对接脚本
-```python
-python fay_connect.py
+
+如果访问不了huggingface，在运行前
+```
+export HF_ENDPOINT=https://hf-mirror.com
 ```
-![](img/weplay.png)

-扫码支助开源开发工作，凭支付单号入qq交流群
+运行成功后，用vlc访问rtmp://serverip/live/livestream

+### 网页端数字人播报输入文字
+安装并启动nginx
+```
+apt install nginx
+nginx
+```
+修改echo.html中websocket和视频播放地址，将serverip替换成实际服务器ip
+然后将echo.html和mpegts-1.7.3.min.js拷到/var/www/html下

+启动数字人
+```python
+python app.py
+```

-接口的输入与输出信息 [Websoket.md](https://github.com/waityousea/xuniren/blob/main/WebSocket.md)
+用浏览器打开http://serverip/echo.html，在文本框输入任意文字，提交。数字人播报该段文字

-虚拟人生成的核心文件
+## Data flow
+![](/assets/dataflow.png)
+
+## 数字人模型文件，可以替换成自己训练的模型(https://github.com/Fictionarry/ER-NeRF)

 ```python
-## 注意，核心文件需要单独训练
 .
 ├── data
-│   ├── kf.json			
+│   ├── data_kf.json			
 │   ├── pretrained
 │   └── └── ngp_kg.pth

 ```

-### Inference Speed
+## TODO
+- 添加chatgpt实现数字人对话
+- 声音克隆
+- 数字人静音时用一段视频代替

-在台式机RTX A4000或笔记本RTX 3080ti的显卡（显存16G）上进行视频推理时，1s可以推理35~43帧，假如1s视频25帧，则1s可推理约1.5s视频。
-
-# Acknowledgement
-
- The data pre-processing part is adapted from [AD-NeRF](https://github.com/YudongGuo/AD-NeRF).
- The NeRF framework is based on [torch-ngp](https://github.com/ashawkey/torch-ngp).
- The algorithm core come from  [RAD-NeRF](https://github.com/ashawkey/RAD-NeRF).
- Usage example [Fay](https://github.com/TheRamU/Fay).
-
-学术交流可发邮件到邮箱：waityousea@126.com
+如果本项目对你有帮助，帮忙点个star。也欢迎感兴趣的朋友一起来完善该项目。  
+Email: lipku@foxmail.com
--- a/app.py
+++ b/app.py
@ -7,10 +7,11 @@ import json
 import gevent
 from gevent import pywsgi
 from geventwebsocket.handler import WebSocketHandler
-from tools import audio_pre_process, video_pre_process, generate_video,audio_process
 import os
 import re
 import numpy as np
+from threading import Thread
+import multiprocessing

 import argparse
 from nerf_triplane.provider import NeRFDataset_Test
@ -24,7 +25,6 @@ import edge_tts

 app = Flask(__name__)
 sockets = Sockets(app)
-video_list = []
 global nerfreal


@ -40,33 +40,15 @@ async def main(voicename: str, text: str, render):
            pass                


-def send_information(path, ws):
-
-        print('传输信息开始！')
-        #path = video_list[0]
-        ''''''
-        with open(path, 'rb') as f:
-            video_data = base64.b64encode(f.read()).decode()
-
-        data = {
-                'video': 'data:video/mp4;base64,%s' % video_data,
-                }
-        json_data = json.dumps(data)
-
-        ws.send(json_data)
-
-
-
 def txt_to_audio(text_):
    audio_list = []
    #audio_path = 'data/audio/aud_0.wav'
    voicename = "zh-CN-YunxiaNeural"
-    # 让我们一起学习。必应由 AI 提供支持，因此可能出现意外和错误。请确保核对事实，并 共享反馈以便我们可以学习和改进!
    text = text_
    asyncio.get_event_loop().run_until_complete(main(voicename,text,nerfreal))
    #audio_process(audio_path)
    
-@sockets.route('/dighuman')
+@sockets.route('/humanecho')
 def echo_socket(ws):
    # 获取WebSocket对象
    #ws = request.environ.get('wsgi.websocket')
@ -81,19 +63,12 @@ def echo_socket(ws):
            message = ws.receive()           
            
            if len(message)==0:
-
                return '输入信息为空'
            else:                                
                txt_to_audio(message)                       
-                audio_path = 'data/audio/aud_0.wav'
-                audio_path_eo = 'data/audio/aud_0_eo.npy'
-                video_path = 'data/video/results/ngp_0.mp4'
-                output_path = 'data/video/results/output_0.mp4'
-                generate_video(audio_path, audio_path_eo, video_path, output_path)
-                video_list.append(output_path)
-                send_information(output_path, ws)
-                

+def render():
+    nerfreal.render()                  
               

 if __name__ == '__main__':
@ -242,12 +217,13 @@ if __name__ == '__main__':

    # we still need test_loader to provide audio features for testing.
    nerfreal = NeRFReal(opt, trainer, test_loader)
-    txt_to_audio('我是中国人,我来自北京')
-    nerfreal.render()
+    #txt_to_audio('我是中国人,我来自北京')
+    rendthrd = Thread(target=render)
+    rendthrd.start()

    #############################################################################
-    
-    server = pywsgi.WSGIServer(('127.0.0.1', 8800), app, handler_class=WebSocketHandler)
+    print('start websocket server')
+    server = pywsgi.WSGIServer(('0.0.0.0', 8000), app, handler_class=WebSocketHandler)
    server.serve_forever()
    
    
--- a/asrreal.py
+++ b/asrreal.py
@ -8,6 +8,7 @@ import pyaudio
 import soundfile as sf
 import resampy

+import queue
 from queue import Queue
 #from collections import deque
 from threading import Thread, Event
@ -318,9 +319,11 @@ class ASR:
                return None
        
        else:
-
-            frame = self.queue.get()
-            print(f'[INFO] get frame {frame.shape}')
+            try:
+                frame = self.queue.get(block=False)
+                print(f'[INFO] get frame {frame.shape}')
+            except queue.Empty:
+                frame = np.zeros(self.chunk, dtype=np.float32)

            self.idx = self.idx + self.chunk

@ -380,10 +383,9 @@ class ASR:

    def push_audio(self,buffer):
        print(f'[INFO] push_audio {len(buffer)}')
-        self.input_stream.write(buffer)
-        if len(buffer)<=0:
-            self.input_stream.seek(0)
-            stream = self.create_bytes_stream(self.input_stream)
+        if len(buffer)>0:
+            byte_stream=BytesIO(buffer)
+            stream = self.create_bytes_stream(byte_stream)
            streamlen = stream.shape[0]
            idx=0
            while streamlen >= self.chunk:
@ -392,6 +394,18 @@ class ASR:
                idx += self.chunk
            if streamlen>0:
                self.queue.put(stream[idx:])
+        # self.input_stream.write(buffer)
+        # if len(buffer)<=0:
+        #     self.input_stream.seek(0)
+        #     stream = self.create_bytes_stream(self.input_stream)
+        #     streamlen = stream.shape[0]
+        #     idx=0
+        #     while streamlen >= self.chunk:
+        #         self.queue.put(stream[idx:idx+self.chunk])
+        #         streamlen -= self.chunk
+        #         idx += self.chunk
+        #     if streamlen>0:
+        #         self.queue.put(stream[idx:])
    
    def get_audio_out(self):
        return self.output_queue.get()
--- a/assets/dataflow.png
+++ b/assets/dataflow.png
--- a/echo.html
+++ b/echo.html
@ -0,0 +1,62 @@
+<!-- index.html -->
+<html>
+<head>
+  <script type="text/javascript" src="mpegts-1.7.3.min.js"></script>
+  <script type="text/javascript" src="http://cdn.sockjs.org/sockjs-0.3.4.js"></script>
+  <script src="http://code.jquery.com/jquery-2.1.1.min.js"></script>
+
+
+  
+</head>
+<body>
+  <div class="container">
+    <h1>WebSocket Test</h1>
+    <form class="form-inline" id="echo-form">
+      <div class="form-group">
+        <p>input text</p>
+
+		<textarea cols="2" rows="3" style="width:600px;height:50px;" class="form-control" id="message">test</textarea>
+      </div>
+      <button type="submit" class="btn btn-default">Send</button>
+    </form>
+    <div id="log">
+		
+	</div>
+	<video id="video_player" width="40%" autoplay controls></video>
+  </div>
+</body>
+<script type="text/javascript" charset="utf-8">
+
+	$(document).ready(function() {
+	  var ws = new WebSocket('ws://serverip:8000/humanecho');
+	  //document.getElementsByTagName("video")[0].setAttribute("src", aa["video"]);
+	  ws.onopen = function() {
+		console.log('Connected');
+	  };
+	  ws.onmessage = function(e) {
+		console.log('Received: ' + e.data);
+		data = e
+		var vid = JSON.parse(data.data); 
+		console.log(typeof(vid),vid)
+		//document.getElementsByTagName("video")[0].setAttribute("src", vid["video"]);
+		
+	  };
+	  ws.onclose = function(e) {
+		console.log('Closed');
+	  };
+
+	  flvPlayer = mpegts.createPlayer({type: 'flv', url: "http://serverip:8080/live/livestream.flv", isLive: true, enableStashBuffer: false});
+	  flvPlayer.attachMediaElement(document.getElementById('video_player'));
+	  flvPlayer.load();
+	  flvPlayer.play();
+
+	  $('#echo-form').on('submit', function(e) {
+		e.preventDefault();
+		var message = $('#message').val();
+		console.log('Sending: ' + message);
+		ws.send(message);
+		$('#message').val('');
+	  });
+	});
+</script>
+ </html>
--- a/mpegts-1.7.3.min.js
+++ b/mpegts-1.7.3.min.js
--- a/nerfreal.py
+++ b/nerfreal.py
@ -144,13 +144,13 @@ class NeRFReal:
                data['auds'] = self.asr.get_next_feat()

            outputs = self.trainer.test_gui_with_data(data, self.W, self.H)
-            print(f'[INFO] outputs shape ',outputs['image'].shape)
+            #print(f'[INFO] outputs shape ',outputs['image'].shape)
            image = (outputs['image'] * 255).astype(np.uint8)
            self.streamer.stream_frame(image)
            #self.pipe.stdin.write(image.tostring())
            for _ in range(2):
                frame = self.asr.get_audio_out()
-                print(f'[INFO] get_audio_out shape ',frame.shape)
+                #print(f'[INFO] get_audio_out shape ',frame.shape)
                self.streamer.stream_frame_audio(frame)
            #     frame = (frame * 32767).astype(np.int16).tobytes()
            #     self.fifo_audio.write(frame)           
--- a/requirements.txt
+++ b/requirements.txt
@ -12,6 +12,7 @@ rich
 dearpygui
 packaging
 scipy
+scikit-learn

 face_alignment
 python_speech_features
@ -24,3 +25,8 @@ configargparse

 lpips
 imageio-ffmpeg
+
+transformers
+edge_tts
+flask
+flask_sockets
--- a/test.html
+++ b/test.html