赞
踩
微软合成语音的过程中受多方面因素影响,具体每个因素有多大的影响,下面通过具体的测试,给出结论。
通过微软的SDK有8种请求合成语音的API,可以分成Text or SSML,同步 or 异步,流式 or 非流式 3种形式。我们分别看每种方式合成语音的延迟情况。另外,还补充不同区域、是否预初始化SpeechSynthesizer、重复内容合成延迟的差异。
SpeakText
SpeakTextAsync
StartSpeakingText
StartSpeakingTextAsync
SpeakSsml
SpeakSsmlAsync
StartSpeakingSsml
StartSpeakingSsmlAsync
语音合成的延迟影响最大因素是区域、是否流式。
音色:zh-CN-XiaochenMultilingualNeural
语音:en-US
输出格式:SpeechSynthesisOutputFormat.Audio24Khz48KBitRateMonoMp3
测试内容:随机生成30个中文
text方式:speechSynthesizerZH.SpeakText(text);
ssml方式:speechSynthesizerZH.SpeakSsml(ssml);
两种方式相差不到,Text比SSML稍微快3~5%。
- void textOrSsml() {
- SpeechConfig config = SpeechConfig.fromSubscription("{key}", "{region}");
- config.setSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Audio24Khz48KBitRateMonoMp3);
- config.setSpeechSynthesisVoiceName("zh-CN-XiaochenMultilingualNeural");
- SpeechSynthesizer speechSynthesizerZH = new SpeechSynthesizer(config, null);
-
- final int TEXT_LENGTH = 30;
- final int LOOP_TIMES = 50;
-
- // ---- Init Link ---- //
- {
- var text = getRandomChinese(TEXT_LENGTH);
- long s = System.currentTimeMillis();
- speechSynthesizerZH.SpeakText(text);
-
- var ssmlText = getRandomChinese(TEXT_LENGTH);
- String ssml = buildSsml(ssmlText, "zh-CN-XiaochenMultilingualNeural", "en-US");
- speechSynthesizerZH.SpeakSsml(ssml);
- System.out.println("Init Link use times: " + (System.currentTimeMillis() - s));
- }
-
- var textTimes = new ArrayList<Long>();
- var ssmlTimes = new ArrayList<Long>();
-
- for (int i = 0; i <= LOOP_TIMES; i++) {
- {
- var text = getRandomChinese(TEXT_LENGTH);
- String ssml = buildSsml(text, "zh-CN-XiaochenMultilingualNeural", "en-US");
- long s = System.currentTimeMillis();
- SpeechSynthesisResult speechSynthesisResult = speechSynthesizerZH.SpeakSsml(ssml);
- ssmlTimes.add(System.currentTimeMillis() - s);
- }
- {
- var text = getRandomChinese(TEXT_LENGTH);
- long s = System.currentTimeMillis();
- SpeechSynthesisResult speechSynthesisResult = speechSynthesizerZH.SpeakText(text);
- textTimes.add(System.currentTimeMillis() - s);
- }
- }
-
- // ---- 生成结果 ---- //
- StringBuilder report = new StringBuilder();
- report.append("NO.\tSpeakText\tSpeakSsml\tDifference\n");
- for (int i = 0; i < textTimes.size(); i++) {
- report.append(i).append("\t");
- report.append(textTimes.get(i)).append("\t");
- report.append(ssmlTimes.get(i)).append("\t");
- report.append(textTimes.get(i) - ssmlTimes.get(i)).append("\n");
- }
- double textTimeAvg = textTimes.stream().mapToLong(Long::longValue).average().orElse(0);
- double ssmlTimeAvg = ssmlTimes.stream().mapToLong(Long::longValue).average().orElse(0);
- report.append("avg").append("\t");
- report.append(textTimeAvg).append("\t");
- report.append(ssmlTimeAvg).append("\t");
- report.append(textTimeAvg - ssmlTimeAvg).append("\n");
- System.out.println(report);
- }
同步方式:speechSynthesizerZH.SpeakText(text);
异步方式:speechSynthesizerZH.SpeakTextAsync(text).get();
异步方式是使用线程池去调用SpeakText(text)方法,从单次请求看,延迟几乎没区别。SSML方式结果也是一样。
NO. | SpeakText | SpeakTextAsync | Difference |
0 | 606 | 567 | 39 |
1 | 589 | 604 | -15 |
2 | 717 | 689 | 28 |
3 | 544 | 519 | 25 |
4 | 665 | 652 | 13 |
5 | 623 | 596 | 27 |
6 | 636 | 574 | 62 |
7 | 655 | 639 | 16 |
8 | 591 | 578 | 13 |
9 | 636 | 759 | -123 |
10 | 563 | 503 | 60 |
11 | 653 | 622 | 31 |
12 | 536 | 719 | -183 |
13 | 633 | 621 | 12 |
14 | 516 | 532 | -16 |
15 | 687 | 470 | 217 |
16 | 590 | 609 | -19 |
17 | 582 | 610 | -28 |
18 | 590 | 590 | 0 |
19 | 623 | 500 | 123 |
20 | 639 | 716 | -77 |
21 | 522 | 581 | -59 |
22 | 908 | 556 | 352 |
23 | 581 | 574 | 7 |
24 | 593 | 565 | 28 |
25 | 607 | 591 | 16 |
26 | 594 | 502 | 92 |
27 | 550 | 834 | -284 |
28 | 624 | 717 | -93 |
29 | 518 | 590 | -72 |
30 | 666 | 488 | 178 |
31 | 659 | 605 | 54 |
32 | 470 | 548 | -78 |
33 | 539 | 496 | 43 |
34 | 547 | 515 | 32 |
35 | 517 | 644 | -127 |
36 | 575 | 513 | 62 |
37 | 559 | 562 | -3 |
38 | 683 | 773 | -90 |
39 | 537 | 716 | -179 |
40 | 777 | 565 | 212 |
41 | 560 | 1126 | -566 |
42 | 552 | 578 | -26 |
43 | 1086 | 655 | 431 |
44 | 789 | 791 | -2 |
45 | 641 | 730 | -89 |
46 | 485 | 590 | -105 |
47 | 867 | 576 | 291 |
48 | 578 | 1076 | -498 |
49 | 657 | 579 | 78 |
50 | 681 | 537 | 144 |
avg | 623.4509804 | 624.3529412 | -0.901960784 |
- void syncOrAsync() throws ExecutionException, InterruptedException {
- SpeechConfig config = SpeechConfig.fromSubscription("{key}", "{region}");
- config.setSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Audio24Khz48KBitRateMonoMp3);
- config.setSpeechSynthesisVoiceName("zh-CN-XiaochenMultilingualNeural");
- SpeechSynthesizer speechSynthesizerZH = new SpeechSynthesizer(config, null);
-
- final int TEXT_LENGTH = 30;
- final int LOOP_TIMES = 50;
-
- // ---- Init Link ---- //
- {
- var text1 = getRandomChinese(TEXT_LENGTH);
- var text2 = getRandomChinese(TEXT_LENGTH);
- long s = System.currentTimeMillis();
- speechSynthesizerZH.SpeakText(text1);
- speechSynthesizerZH.SpeakTextAsync(text2).get();
- System.out.println("Init Link use times: " + (System.currentTimeMillis() - s));
- }
-
- var textTimes = new ArrayList<Long>();
- var testAsyncTimes = new ArrayList<Long>();
-
- for (int i = 0; i <= LOOP_TIMES; i++) {
- {
- var text = getRandomChinese(TEXT_LENGTH);
- long s = System.currentTimeMillis();
- speechSynthesizerZH.SpeakText(text);
- textTimes.add(System.currentTimeMillis() - s);
- }
- {
- var text = getRandomChinese(TEXT_LENGTH);
- long s = System.currentTimeMillis();
- speechSynthesizerZH.SpeakTextAsync(text).get();
- testAsyncTimes.add(System.currentTimeMillis() - s);
- }
- }
-
- // ---- 生成结果 ---- //
- StringBuilder report = new StringBuilder();
- report.append("NO.\tSpeakText\tSpeakTextAsync\tDifference\n");
- for (int i = 0; i < textTimes.size(); i++) {
- report.append(i).append("\t");
- report.append(textTimes.get(i)).append("\t");
- report.append(testAsyncTimes.get(i)).append("\t");
- report.append(textTimes.get(i) - testAsyncTimes.get(i)).append("\n");
- }
- double textTimeAvg = textTimes.stream().mapToLong(Long::longValue).average().orElse(0);
- double ssmlTimeAvg = testAsyncTimes.stream().mapToLong(Long::longValue).average().orElse(0);
- report.append("avg").append("\t");
- report.append(textTimeAvg).append("\t");
- report.append(ssmlTimeAvg).append("\t");
- report.append(textTimeAvg - ssmlTimeAvg).append("\n");
- System.out.println(report);
- }
流式方式:speechSynthesizerZH.StartSpeakingText(text);
非流式方式:speechSynthesizerZH.SpeakText(text);
流式接口返回的首个音频数据,比非流式完整返回的延迟低了30%。在10个字的情况下,低了20%。文本越长,流式的延迟效果越好。
NO. | NotStream | Strream | Difference |
0 | 728 | 746 | -18 |
1 | 701 | 483 | 218 |
2 | 744 | 452 | 292 |
3 | 728 | 505 | 223 |
4 | 628 | 481 | 147 |
5 | 657 | 502 | 155 |
6 | 701 | 437 | 264 |
7 | 713 | 489 | 224 |
8 | 657 | 469 | 188 |
9 | 731 | 608 | 123 |
10 | 656 | 472 | 184 |
11 | 681 | 502 | 179 |
12 | 727 | 467 | 260 |
13 | 796 | 531 | 265 |
14 | 689 | 439 | 250 |
15 | 711 | 474 | 237 |
16 | 790 | 410 | 380 |
17 | 719 | 698 | 21 |
18 | 671 | 423 | 248 |
19 | 791 | 595 | 196 |
20 | 736 | 410 | 326 |
21 | 775 | 488 | 287 |
22 | 644 | 465 | 179 |
23 | 700 | 460 | 240 |
24 | 718 | 441 | 277 |
25 | 819 | 530 | 289 |
26 | 671 | 426 | 245 |
27 | 595 | 438 | 157 |
28 | 746 | 516 | 230 |
29 | 853 | 504 | 349 |
30 | 700 | 485 | 215 |
31 | 787 | 466 | 321 |
32 | 734 | 470 | 264 |
33 | 680 | 426 | 254 |
34 | 928 | 458 | 470 |
35 | 729 | 434 | 295 |
36 | 689 | 699 | -10 |
37 | 761 | 428 | 333 |
38 | 722 | 472 | 250 |
39 | 661 | 591 | 70 |
40 | 719 | 502 | 217 |
41 | 756 | 482 | 274 |
42 | 708 | 426 | 282 |
43 | 858 | 456 | 402 |
44 | 728 | 380 | 348 |
45 | 729 | 472 | 257 |
46 | 667 | 443 | 224 |
47 | 762 | 441 | 321 |
48 | 699 | 380 | 319 |
49 | 594 | 379 | 215 |
50 | 625 | 499 | 126 |
avg | 719.8431373 | 483.3333333 | 236.5098039 |
- void streamOrNotStream() {
- SpeechConfig config = SpeechConfig.fromSubscription("{key}", "{region}");
- config.setSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Audio24Khz48KBitRateMonoMp3);
- config.setSpeechSynthesisVoiceName("zh-CN-XiaochenMultilingualNeural");
- SpeechSynthesizer speechSynthesizerZH = new SpeechSynthesizer(config, null);
-
- final int TEXT_LENGTH = 30;
- final int LOOP_TIMES = 50;
-
- // ---- Init Link ---- //
- {
- var text1 = getRandomChinese(TEXT_LENGTH);
- var text2 = getRandomChinese(TEXT_LENGTH);
- long s = System.currentTimeMillis();
- speechSynthesizerZH.SpeakText(text1);
- SpeechSynthesisResult speechSynthesisResult = speechSynthesizerZH.StartSpeakingText(text2);
- AudioDataStream audioDataStream = AudioDataStream.fromResult(speechSynthesisResult);
- byte[] buffer = new byte[10000];
- long filledSize = audioDataStream.readData(buffer);
- System.out.println("Init Link use times: " + (System.currentTimeMillis() - s));
- }
-
- var notStreamTimes = new ArrayList<Long>();
- var streamTimes = new ArrayList<Long>();
-
- for (int i = 0; i <= LOOP_TIMES; i++) {
- {
- var text = getRandomChinese(TEXT_LENGTH);
- long s = System.currentTimeMillis();
- speechSynthesizerZH.SpeakText(text);
- notStreamTimes.add(System.currentTimeMillis() - s);
- }
- {
- var text = getRandomChinese(TEXT_LENGTH);
- long s = System.currentTimeMillis();
- SpeechSynthesisResult speechSynthesisResult = speechSynthesizerZH.StartSpeakingText(text);
- AudioDataStream audioDataStream = AudioDataStream.fromResult(speechSynthesisResult);
- byte[] buffer = new byte[8000];
- long filledSize = audioDataStream.readData(buffer);
- streamTimes.add(System.currentTimeMillis() - s);
- }
- }
-
- // ---- 生成结果 ---- //
- StringBuilder report = new StringBuilder();
- report.append("NO.\tNotStream\tStream\tDifference\n");
- for (int i = 0; i < notStreamTimes.size(); i++) {
- report.append(i).append("\t");
- report.append(notStreamTimes.get(i)).append("\t");
- report.append(streamTimes.get(i)).append("\t");
- report.append(notStreamTimes.get(i) - streamTimes.get(i)).append("\n");
- }
- double notStreamTimeAvg = notStreamTimes.stream().mapToLong(Long::longValue).average().orElse(0);
- double streamTimeAvg = streamTimes.stream().mapToLong(Long::longValue).average().orElse(0);
- report.append("avg").append("\t");
- report.append(notStreamTimeAvg).append("\t");
- report.append(streamTimeAvg).append("\t");
- report.append(notStreamTimeAvg - streamTimeAvg).append("\n");
- System.out.println(report);
- }
测试区域:eastasia(东亚)、southeastasia(东南亚)、japanwest(日本西)、japaneast(日本东)
请求区域:东京
japanwest(日本西:505ms)< japaneast(日本东:542)< eastasia(东亚:630)< southeastasia(东南亚:733)
从东亚换到日本西,延迟降低20%
NO. | japaneast | eastasia | japanwest | southeastasia |
1 | 507 | 843 | 771 | 607 |
2 | 507 | 626 | 465 | 1072 |
3 | 500 | 609 | 445 | 693 |
4 | 594 | 597 | 488 | 815 |
5 | 627 | 679 | 482 | 663 |
6 | 563 | 626 | 440 | 642 |
7 | 724 | 828 | 585 | 714 |
8 | 502 | 599 | 469 | 848 |
9 | 573 | 534 | 599 | 675 |
10 | 506 | 627 | 420 | 703 |
11 | 519 | 597 | 522 | 611 |
12 | 579 | 600 | 448 | 917 |
13 | 599 | 653 | 608 | 704 |
14 | 540 | 624 | 439 | 871 |
15 | 606 | 625 | 517 | 647 |
16 | 557 | 530 | 468 | 730 |
17 | 580 | 601 | 468 | 792 |
18 | 590 | 575 | 414 | 633 |
19 | 521 | 520 | 447 | 690 |
20 | 643 | 617 | 448 | 687 |
21 | 542 | 772 | 448 | 688 |
22 | 486 | 521 | 414 | 664 |
23 | 421 | 662 | 727 | 678 |
24 | 607 | 559 | 439 | 728 |
25 | 453 | 593 | 1277 | 619 |
26 | 455 | 538 | 422 | 814 |
27 | 547 | 574 | 470 | 625 |
28 | 604 | 619 | 469 | 654 |
29 | 525 | 601 | 538 | 637 |
30 | 482 | 510 | 456 | 793 |
31 | 556 | 556 | 426 | 830 |
32 | 480 | 566 | 448 | 642 |
33 | 585 | 633 | 699 | 739 |
34 | 530 | 637 | 452 | 822 |
35 | 520 | 521 | 438 | 765 |
36 | 457 | 533 | 532 | 717 |
37 | 548 | 631 | 572 | 1102 |
38 | 508 | 841 | 427 | 655 |
39 | 508 | 583 | 463 | 802 |
40 | 623 | 793 | 427 | 693 |
41 | 453 | 1141 | 513 | 745 |
42 | 584 | 606 | 432 | 669 |
43 | 538 | 703 | 538 | 1022 |
44 | 537 | 715 | 437 | 668 |
45 | 549 | 526 | 513 | 714 |
46 | 534 | 569 | 407 | 707 |
47 | 582 | 585 | 444 | 683 |
48 | 492 | 629 | 452 | 681 |
49 | 511 | 572 | 603 | 764 |
50 | 573 | 705 | 463 | 648 |
avg | 542.54 | 630.08 | 505.78 | 733.64 |
- static String diffRegion(int textLength, int loopTimes, Pair<String, String>... regions) {
- Map<String, SpeechSynthesizer> regionSpeechSynthesizer = new HashMap<>();
- Map<String, List<Long>> regionTimes = new HashMap<>();
- for (Pair<String, String> pair : regions) {
- val region = pair.getRight();
- SpeechConfig config = SpeechConfig.fromSubscription(pair.getLeft(), region);
- config.setSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Audio24Khz48KBitRateMonoMp3);
- config.setSpeechSynthesisVoiceName("zh-CN-XiaochenMultilingualNeural");
- SpeechSynthesizer speechSynthesizer = new SpeechSynthesizer(config, null);
-
- regionSpeechSynthesizer.put(region, speechSynthesizer);
- regionTimes.put(region, new ArrayList<>());
- }
-
- // ---- Init Link ---- //
- {
- long s = System.currentTimeMillis();
- for (Map.Entry<String, SpeechSynthesizer> entry : regionSpeechSynthesizer.entrySet()) {
- var text = getRandomChinese(textLength);
- SpeechSynthesisResult speechSynthesisResult = entry.getValue().SpeakText(text);
- if (speechSynthesisResult.getReason() == ResultReason.Canceled) {
- SpeechSynthesisCancellationDetails speechSynthesisCancellationDetails = SpeechSynthesisCancellationDetails.fromResult(speechSynthesisResult);
- System.out.println("failed: " + entry.getKey());
- System.out.println(speechSynthesisCancellationDetails.getErrorDetails());
- ;
- } else {
- System.out.println("ok:" + entry.getKey());
- }
- }
- System.out.println("Init Link use times: " + (System.currentTimeMillis() - s));
- }
-
- for (int i = 1; i <= loopTimes; i++) {
- System.out.println("loop NO." + i);
- for (Map.Entry<String, SpeechSynthesizer> entry : regionSpeechSynthesizer.entrySet()) {
- var text = getRandomChinese(textLength);
- long s = System.currentTimeMillis();
- entry.getValue().SpeakText(text);
- regionTimes.get(entry.getKey()).add(System.currentTimeMillis() - s);
- }
- }
-
- // ---- 生成结果 ---- //
- StringBuilder report = new StringBuilder();
- report.append("NO.");
- for (String region : regionTimes.keySet()) {
- report.append("\t" + region);
- }
- report.append("\n");
-
- for (int i = 0; i < loopTimes; i++) {
- report.append(i + 1).append("\t");
- for (List<Long> times : regionTimes.values()) {
- report.append(times.get(i)).append("\t");
- }
- report.append("\n");
- }
- // 统计平均时间
- report.append("avg").append("\t");
- for (List<Long> times : regionTimes.values()) {
- double avg = times.stream().mapToLong(Long::longValue).average().orElse(0);
- report.append(avg).append("\t");
- }
- report.append("\n");
- return report.toString();
- }
预连接方式:
- SpeechSynthesizer synthesizer = new SpeechSynthesizer(speechConfig, (AudioConfig) null);
- Connection connection = Connection.fromSpeechSynthesizer(synthesizer);
- connection.openConnection(true);
合成语音延迟没有差别。
NO. | PreConnect | NotPreConnect | Difference |
0 | 350 | 334 | 16 |
1 | 347 | 305 | 42 |
2 | 350 | 410 | -60 |
3 | 350 | 350 | 0 |
4 | 335 | 350 | -15 |
5 | 349 | 336 | 13 |
6 | 364 | 348 | 16 |
7 | 334 | 349 | -15 |
8 | 347 | 350 | -3 |
9 | 348 | 349 | -1 |
10 | 347 | 365 | -18 |
11 | 349 | 339 | 10 |
12 | 354 | 351 | 3 |
13 | 336 | 348 | -12 |
14 | 350 | 368 | -18 |
15 | 330 | 368 | -38 |
16 | 337 | 349 | -12 |
17 | 336 | 353 | -17 |
18 | 367 | 350 | 17 |
19 | 348 | 346 | 2 |
20 | 348 | 348 | 0 |
21 | 350 | 335 | 15 |
22 | 351 | 353 | -2 |
23 | 350 | 349 | 1 |
24 | 352 | 352 | 0 |
25 | 355 | 351 | 4 |
26 | 335 | 337 | -2 |
27 | 349 | 336 | 13 |
28 | 349 | 352 | -3 |
29 | 352 | 363 | -11 |
30 | 349 | 353 | -4 |
31 | 350 | 335 | 15 |
32 | 352 | 303 | 49 |
33 | 349 | 349 | 0 |
34 | 352 | 349 | 3 |
35 | 351 | 349 | 2 |
36 | 349 | 351 | -2 |
37 | 351 | 349 | 2 |
38 | 352 | 347 | 5 |
39 | 348 | 351 | -3 |
40 | 349 | 351 | -2 |
41 | 321 | 349 | -28 |
42 | 349 | 349 | 0 |
43 | 365 | 351 | 14 |
44 | 350 | 350 | 0 |
45 | 349 | 351 | -2 |
46 | 349 | 336 | 13 |
47 | 335 | 335 | 0 |
48 | 350 | 337 | 13 |
49 | 350 | 348 | 2 |
50 | 351 | 349 | 2 |
avg | 347.8431373 | 347.7647059 | 0.078431373 |
- void preConnect() throws InterruptedException {
- final int TEXT_LENGTH = 30;
- final int LOOP_TIMES = 50;
-
- var reuseTimes = new ArrayList<Long>();
- var notReuseTimes = new ArrayList<Long>();
-
- for (int i = 0; i <= LOOP_TIMES; i++) {
- {
- SpeechConfig config1 = SpeechConfig.fromSubscription("{key}", "{region}");
- config1.setSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Audio24Khz48KBitRateMonoMp3);
- config1.setSpeechSynthesisVoiceName("zh-CN-XiaochenMultilingualNeural");
- SpeechSynthesizer speechSynthesizer = new SpeechSynthesizer(config1, null);
- Connection connection = Connection.fromSpeechSynthesizer(speechSynthesizer);
- connection.openConnection(true);
-
- Thread.sleep(200);
-
- var text = getRandomChinese(TEXT_LENGTH);
- long s = System.currentTimeMillis();
- speechSynthesizer.SpeakText(text);
- reuseTimes.add(System.currentTimeMillis() - s);
- }
-
- {
- SpeechConfig config2 = SpeechConfig.fromSubscription("{key}", "{region}");
- config2.setSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Audio24Khz48KBitRateMonoMp3);
- config2.setSpeechSynthesisVoiceName("zh-CN-XiaochenMultilingualNeural");
- SpeechSynthesizer notPreConnectSpeechSynthesizer = new SpeechSynthesizer(config2, null);
-
- Thread.sleep(200);
-
- var text = getRandomChinese(TEXT_LENGTH);
- long s = System.currentTimeMillis();
- notPreConnectSpeechSynthesizer.SpeakText(text);
- notReuseTimes.add(System.currentTimeMillis() - s);
- }
- }
-
- // ---- 生成结果 ---- //
- StringBuilder report = new StringBuilder();
- report.append("NO.\tPreConnect\tNotPreConnect\tDifference\n");
- for (int i = 0; i < reuseTimes.size(); i++) {
- report.append(i).append("\t");
- report.append(reuseTimes.get(i)).append("\t");
- report.append(notReuseTimes.get(i)).append("\t");
- report.append(reuseTimes.get(i) - notReuseTimes.get(i)).append("\n");
- }
- double avg1 = reuseTimes.stream().mapToLong(Long::longValue).average().orElse(0);
- double avg2 = notReuseTimes.stream().mapToLong(Long::longValue).average().orElse(0);
- report.append("avg").append("\t");
- report.append(avg1).append("\t");
- report.append(avg2).append("\t");
- report.append(avg1 - avg2).append("\n");
- System.out.println(report);
- }
重用:从池子里获取SpeechSynthesizer对象
不重用:每次请求重新new SpeechSynthesizer对象
合成语音延迟没有差别。
NO. | Reuse | NotReuse | Difference |
0 | 339 | 336 | 3 |
1 | 348 | 364 | -16 |
2 | 335 | 353 | -18 |
3 | 349 | 349 | 0 |
4 | 349 | 336 | 13 |
5 | 349 | 353 | -4 |
6 | 363 | 349 | 14 |
7 | 363 | 353 | 10 |
8 | 348 | 351 | -3 |
9 | 350 | 335 | 15 |
10 | 352 | 350 | 2 |
11 | 348 | 350 | -2 |
12 | 351 | 349 | 2 |
13 | 355 | 349 | 6 |
14 | 352 | 352 | 0 |
15 | 350 | 349 | 1 |
16 | 368 | 350 | 18 |
17 | 350 | 350 | 0 |
18 | 349 | 348 | 1 |
19 | 352 | 364 | -12 |
20 | 349 | 352 | -3 |
21 | 349 | 351 | -2 |
22 | 351 | 348 | 3 |
23 | 351 | 349 | 2 |
24 | 351 | 351 | 0 |
25 | 365 | 367 | -2 |
26 | 350 | 381 | -31 |
27 | 352 | 349 | 3 |
28 | 351 | 336 | 15 |
29 | 352 | 350 | 2 |
30 | 351 | 349 | 2 |
31 | 351 | 331 | 20 |
32 | 349 | 367 | -18 |
33 | 351 | 352 | -1 |
34 | 349 | 352 | -3 |
35 | 349 | 350 | -1 |
36 | 351 | 365 | -14 |
37 | 353 | 346 | 7 |
38 | 351 | 351 | 0 |
39 | 348 | 350 | -2 |
40 | 350 | 349 | 1 |
41 | 379 | 363 | 16 |
42 | 350 | 347 | 3 |
43 | 335 | 346 | -11 |
44 | 352 | 347 | 5 |
45 | 350 | 352 | -2 |
46 | 351 | 347 | 4 |
47 | 350 | 368 | -18 |
48 | 353 | 347 | 6 |
49 | 350 | 335 | 15 |
50 | 366 | 350 | 16 |
avg | 351.5686275 | 350.745098 | 0.823529412 |
- void reuseOrNot() {
- SpeechConfig config1 = SpeechConfig.fromSubscription("{key}", "{region}");
- config1.setSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Audio24Khz48KBitRateMonoMp3);
- config1.setSpeechSynthesisVoiceName("zh-CN-XiaochenMultilingualNeural");
- SpeechSynthesizer reuseSpeechSynthesizer = new SpeechSynthesizer(config1, null);
-
- final int TEXT_LENGTH = 30;
- final int LOOP_TIMES = 50;
-
- // ---- Init Link ---- //
- {
- var text1 = getRandomChinese(TEXT_LENGTH);
- long s = System.currentTimeMillis();
- reuseSpeechSynthesizer.SpeakText(text1);
- System.out.println("Init Link use times: " + (System.currentTimeMillis() - s));
- }
-
- var reuseTimes = new ArrayList<Long>();
- var notReuseTimes = new ArrayList<Long>();
-
- for (int i = 0; i <= LOOP_TIMES; i++) {
- {
- var text = getRandomChinese(TEXT_LENGTH);
- long s = System.currentTimeMillis();
- reuseSpeechSynthesizer.SpeakText(text);
- reuseTimes.add(System.currentTimeMillis() - s);
- }
- {
- long s = System.currentTimeMillis();
-
- SpeechConfig config2 = SpeechConfig.fromSubscription("{key}", "{region}");
- config2.setSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Audio24Khz48KBitRateMonoMp3);
- config2.setSpeechSynthesisVoiceName("zh-CN-XiaochenMultilingualNeural");
- SpeechSynthesizer notPreInitSpeechSynthesizer = new SpeechSynthesizer(config2, null);
-
- var text = getRandomChinese(TEXT_LENGTH);
- notPreInitSpeechSynthesizer.SpeakText(text);
- notReuseTimes.add(System.currentTimeMillis() - s);
- }
- }
-
- // ---- 生成结果 ---- //
- StringBuilder report = new StringBuilder();
- report.append("NO.\tReuse\tNotReuse\tDifference\n");
- for (int i = 0; i < reuseTimes.size(); i++) {
- report.append(i).append("\t");
- report.append(reuseTimes.get(i)).append("\t");
- report.append(notReuseTimes.get(i)).append("\t");
- report.append(reuseTimes.get(i) - notReuseTimes.get(i)).append("\n");
- }
- double reuseTimeAvg = reuseTimes.stream().mapToLong(Long::longValue).average().orElse(0);
- double notReuseTimeAvg = notReuseTimes.stream().mapToLong(Long::longValue).average().orElse(0);
- report.append("avg").append("\t");
- report.append(reuseTimeAvg).append("\t");
- report.append(notReuseTimeAvg).append("\t");
- report.append(reuseTimeAvg - notReuseTimeAvg).append("\n");
- System.out.println(report);
- }
相同内容,重复多次调用。
合成语音延迟没有差别。
NO. | SameText | NotSameText | Difference |
0 | 363 | 365 | -2 |
1 | 350 | 346 | 4 |
2 | 349 | 350 | -1 |
3 | 333 | 349 | -16 |
4 | 349 | 348 | 1 |
5 | 350 | 352 | -2 |
6 | 335 | 350 | -15 |
7 | 335 | 347 | -12 |
8 | 350 | 365 | -15 |
9 | 348 | 348 | 0 |
10 | 350 | 347 | 3 |
11 | 350 | 350 | 0 |
12 | 349 | 352 | -3 |
13 | 350 | 350 | 0 |
14 | 334 | 348 | -14 |
15 | 348 | 350 | -2 |
16 | 352 | 349 | 3 |
17 | 335 | 351 | -16 |
18 | 351 | 347 | 4 |
19 | 348 | 353 | -5 |
20 | 349 | 347 | 2 |
21 | 335 | 347 | -12 |
22 | 349 | 347 | 2 |
23 | 331 | 351 | -20 |
24 | 350 | 352 | -2 |
25 | 349 | 334 | 15 |
26 | 350 | 363 | -13 |
27 | 349 | 349 | 0 |
28 | 350 | 347 | 3 |
29 | 350 | 352 | -2 |
30 | 348 | 351 | -3 |
31 | 351 | 350 | 1 |
32 | 350 | 320 | 30 |
33 | 348 | 349 | -1 |
34 | 380 | 351 | 29 |
35 | 347 | 353 | -6 |
36 | 347 | 340 | 7 |
37 | 349 | 347 | 2 |
38 | 349 | 351 | -2 |
39 | 333 | 350 | -17 |
40 | 366 | 353 | 13 |
41 | 363 | 347 | 16 |
42 | 353 | 348 | 5 |
43 | 350 | 353 | -3 |
44 | 351 | 349 | 2 |
45 | 350 | 347 | 3 |
46 | 349 | 351 | -2 |
47 | 348 | 348 | 0 |
48 | 337 | 350 | -13 |
49 | 332 | 352 | -20 |
50 | 352 | 349 | 3 |
avg | 347.9215686 | 349.3137255 | -1.392156863 |
- void sameText() throws InterruptedException {
- SpeechConfig config1 = SpeechConfig.fromSubscription("{key}", "{region}");
- config1.setSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Audio24Khz48KBitRateMonoMp3);
- config1.setSpeechSynthesisVoiceName("zh-CN-XiaochenMultilingualNeural");
- SpeechSynthesizer sameSpeechSynthesizer = new SpeechSynthesizer(config1, null);
-
- SpeechConfig config2 = SpeechConfig.fromSubscription("{key}", "{region}");
- config2.setSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Audio24Khz48KBitRateMonoMp3);
- config2.setSpeechSynthesisVoiceName("zh-CN-XiaochenMultilingualNeural");
- SpeechSynthesizer speechSynthesizer = new SpeechSynthesizer(config2, null);
-
- final int TEXT_LENGTH = 30;
- final int LOOP_TIMES = 50;
-
- var sameTextTimes = new ArrayList<Long>();
- var notSameTextTimes = new ArrayList<Long>();
-
- var sameText = getRandomChinese(TEXT_LENGTH);
- for (int i = 0; i <= LOOP_TIMES; i++) {
- {
- long s = System.currentTimeMillis();
- sameSpeechSynthesizer.SpeakText(sameText);
- sameTextTimes.add(System.currentTimeMillis() - s);
- }
-
- {
- var text = getRandomChinese(TEXT_LENGTH);
- long s = System.currentTimeMillis();
- speechSynthesizer.SpeakText(text);
- notSameTextTimes.add(System.currentTimeMillis() - s);
- }
- }
-
- // ---- 生成结果 ---- //
- StringBuilder report = new StringBuilder();
- report.append("NO.\tSameText\tNotSameText\tDifference\n");
- for (int i = 0; i < sameTextTimes.size(); i++) {
- report.append(i).append("\t");
- report.append(sameTextTimes.get(i)).append("\t");
- report.append(notSameTextTimes.get(i)).append("\t");
- report.append(sameTextTimes.get(i) - notSameTextTimes.get(i)).append("\n");
- }
- double avg1 = sameTextTimes.stream().mapToLong(Long::longValue).average().orElse(0);
- double avg2 = notSameTextTimes.stream().mapToLong(Long::longValue).average().orElse(0);
- report.append("avg").append("\t");
- report.append(avg1).append("\t");
- report.append(avg2).append("\t");
- report.append(avg1 - avg2).append("\n");
- System.out.println(report);
- }
- public static String getRandomChinese(int length) {
- StringBuilder sb = new StringBuilder();
- Random random = new Random();
-
- for (int i = 0; i < length; i++) {
- int codePoint = 0x4e00 + random.nextInt(0x9fa5 - 0x4e00 + 1);
- sb.append((char) codePoint);
- }
-
- return sb.toString();
- }
- private final static String SSML_TEMPLATE = """
- <speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="{lang}">
- <voice name="{voiceName}">
- {text}
- </voice>
- </speak>
- """;
-
- public static String buildSsml(String text, String voiceName, String lang) {
- return SSML_TEMPLATE
- // 设置了<lang xml:lang="{lang}">标签里的多语言,就无法识别多语音
- .replaceAll("\\{lang\\}", lang)
- .replaceAll("\\{voiceName\\}", voiceName)
- .replaceAll("\\{text\\}", text);
- }
预连接SpeechSynthesizer:如何使用语音 SDK 降低语音合成延迟 - Azure AI services | Microsoft Learn
重用SpeechSynthesizer:如何使用语音 SDK 降低语音合成延迟 - Azure AI services | Microsoft Learn
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。