当前位置:   article > 正文

librosa 语音库(三) librosa.feature. 中的 spectrogram 与 melspectrogram_librosa.feature.melspectrogram

librosa.feature.melspectrogram

窗口的长度与 n_fft 需要匹配大小长度;

1. Mel 语谱图的函数定义

librosa.feature.melspectrogram()函数在spectral.py 中,实现过程为:

def melspectrogram(y=None, sr=22050, S=None, n_fft=2048, hop_length=512,
                   power=2.0, **kwargs):

    S, n_fft = _spectrogram(y=y, S=S, n_fft=n_fft, hop_length=hop_length, power=power)

    # Build a Mel filter
    mel_basis = filters.mel(sr, n_fft, **kwargs)

    return np.dot(mel_basis, S)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

可以看出 Mel_ 语谱图的计算主要有两个函数构成

  1. 计算出信号的语谱图(功率谱形式构成的), 由 _spectrogram() 函数实现
  2. 构造Mel 滤波器, 由filters.mel 函数实现;
  3. 将Mel 滤波器组与语谱图做矩阵乘法, 得到 mel 语谱图;

1.1 _spectrogram() 函数实现

def _spectrogram(y=None, S=None, n_fft=2048, hop_length=512, power=1):
 
    if S is not None:
        # Infer n_fft from spectrogram shape
        n_fft = 2 * (S.shape[0] - 1)
    else:
        # Otherwise, compute a magnitude spectrogram from input
        S = np.abs(stft(y, n_fft=n_fft, hop_length=hop_length))**power

    return S, n_fft
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

其中的核心代码部分为stft()函数,短时傅里叶变化stft函数的实现过程为:

def stft(y, n_fft=2048, hop_length=None, win_length=None, window='hann',
         center=True, dtype=np.complex64, pad_mode='reflect'):

    # By default, use the entire frame
    if win_length is None:
        win_length = n_fft

    # Set the default hop, if it's not already specified
    if hop_length is None:
        hop_length = int(win_length // 4)

    fft_window = get_window(window, win_length, fftbins=True)

    # Pad the window out to n_fft size
    fft_window = util.pad_center(fft_window, n_fft)

    # Reshape so that the window can be broadcast
    fft_window = fft_window.reshape((-1, 1))

    # Check audio is valid
    util.valid_audio(y)

    # Pad the time series so that frames are centered
    if center:
        y = np.pad(y, int(n_fft // 2), mode=pad_mode)

    # Window the time series.
    y_frames = util.frame(y, frame_length=n_fft, hop_length=hop_length)

    # Pre-allocate the STFT matrix
    stft_matrix = np.empty((int(1 + n_fft // 2), y_frames.shape[1]),
                           dtype=dtype,
                           order='F')

    # how many columns can we fit within MAX_MEM_BLOCK?
    n_columns = int(util.MAX_MEM_BLOCK / (stft_matrix.shape[0] *
                                          stft_matrix.itemsize))

    for bl_s in range(0, stft_matrix.shape[1], n_columns):
        bl_t = min(bl_s + n_columns, stft_matrix.shape[1])

        stft_matrix[:, bl_s:bl_t] = fft.fft(fft_window *
                                            y_frames[:, bl_s:bl_t],
                                            axis=0)[:stft_matrix.shape[0]]

    return stft_matrix
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46

1.2 filters.mel 函数实现

输出是 [ n_mels, 1 + (n_fft / 2) ]

这其中需要的注意的是,不同的librosa版本的mel()函数的归一化参数传递的也是不一样的,比如python3.6中的norm可以传递1 或是 None,但是python3.7中就需要传递norm=‘slaney’.

产生一个线性变换矩阵, 将FFT 频率下的bin 的 转化到 Mel 尺度下的bin

@cache(level=10)
def mel(
    sr,
    n_fft,
    n_mels=128,
    fmin=0.0,
    fmax=None,
    htk=False,
    norm="slaney",
    dtype=np.float32,
):
    """Create a Mel filter-bank.

    This produces a linear transformation matrix to project
    FFT bins onto Mel-frequency bins.

    Parameters
    ----------
    sr        : number > 0 [scalar]
        sampling rate of the incoming signal

    n_fft     : int > 0 [scalar]
        number of FFT components

    n_mels    : int > 0 [scalar]
        number of Mel bands to generate

    fmin      : float >= 0 [scalar]
        lowest frequency (in Hz)

    fmax      : float >= 0 [scalar]
        highest frequency (in Hz).
        If `None`, use ``fmax = sr / 2.0``

    htk       : bool [scalar]
        use HTK formula instead of Slaney

    norm : {None, 'slaney', or number} [scalar]
        If 'slaney', divide the triangular mel weights by the width of the mel band
        (area normalization).

        If numeric, use `librosa.util.normalize` to normalize each filter by to unit l_p norm.
        See `librosa.util.normalize` for a full description of supported norm values
        (including `+-np.inf`).

        Otherwise, leave all the triangles aiming for a peak value of 1.0

    dtype : np.dtype
        The data type of the output basis.
        By default, uses 32-bit (single-precision) floating point.

    Returns
    -------
    M         : np.ndarray [shape=(n_mels, 1 + n_fft/2)]
        Mel transform matrix

    See also
    --------
    librosa.util.normalize

    Notes
    -----
    This function caches at level 10.

    Examples
    --------
    >>> melfb = librosa.filters.mel(22050, 2048)
    >>> melfb
    array([[ 0.   ,  0.016, ...,  0.   ,  0.   ],
           [ 0.   ,  0.   , ...,  0.   ,  0.   ],
           ...,
           [ 0.   ,  0.   , ...,  0.   ,  0.   ],
           [ 0.   ,  0.   , ...,  0.   ,  0.   ]])


    Clip the maximum frequency to 8KHz

    >>> librosa.filters.mel(22050, 2048, fmax=8000)
    array([[ 0.  ,  0.02, ...,  0.  ,  0.  ],
           [ 0.  ,  0.  , ...,  0.  ,  0.  ],
           ...,
           [ 0.  ,  0.  , ...,  0.  ,  0.  ],
           [ 0.  ,  0.  , ...,  0.  ,  0.  ]])


    >>> import matplotlib.pyplot as plt
    >>> fig, ax = plt.subplots()
    >>> img = librosa.display.specshow(melfb, x_axis='linear', ax=ax)
    >>> ax.set(ylabel='Mel filter', title='Mel filter bank')
    >>> fig.colorbar(img, ax=ax)
    """

    if fmax is None:
        fmax = float(sr) / 2

    # Initialize the weights
    n_mels = int(n_mels)
    weights = np.zeros((n_mels, int(1 + n_fft // 2)), dtype=dtype)

    # Center freqs of each FFT bin
    fftfreqs = fft_frequencies(sr=sr, n_fft=n_fft)

    # 'Center freqs' of mel bands - uniformly spaced between limits
    mel_f = mel_frequencies(n_mels + 2, fmin=fmin, fmax=fmax, htk=htk)

    fdiff = np.diff(mel_f)
    ramps = np.subtract.outer(mel_f, fftfreqs)

    for i in range(n_mels):
        # lower and upper slopes for all bins
        lower = -ramps[i] / fdiff[i]
        upper = ramps[i + 2] / fdiff[i + 1]

        # .. then intersect them with each other and zero
        weights[i] = np.maximum(0, np.minimum(lower, upper))

    if norm == "slaney":
        # Slaney-style mel is scaled to be approx constant energy per channel
        enorm = 2.0 / (mel_f[2 : n_mels + 2] - mel_f[:n_mels])
        weights *= enorm[:, np.newaxis]
    else:
        weights = util.normalize(weights, norm=norm, axis=-1)

    # Only check weights if f_mel[0] is positive
    if not np.all((mel_f[:-2] == 0) | (weights.max(axis=1) > 0)):
        # This means we have an empty channel somewhere
        warnings.warn(
            "Empty filters detected in mel frequency basis. "
            "Some channels will produce empty responses. "
            "Try increasing your sampling rate (and fmax) or "
            "reducing n_mels."
        )

    return weights

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115
  • 116
  • 117
  • 118
  • 119
  • 120
  • 121
  • 122
  • 123
  • 124
  • 125
  • 126
  • 127
  • 128
  • 129
  • 130
  • 131
  • 132
  • 133
  • 134
  • 135

2. 实例分析

假设 调用形式如下 :

librosa.feature.melspectrogram(y=current_window ,sr=sample_rate, n_mels=n_mels, fmin=f_min, fmax=f_max, n_fft=nfft, hop_length=hop)



def melspectrogram(
    y=None,
    sr=22050,
    S=None,
    n_fft=2048,
    hop_length=512,
    win_length=None,
    window="hann",
    center=True,
    pad_mode="reflect",
    power=2.0,
    **kwargs,
):

    S, n_fft = _spectrogram(
        y=y,
        S=S,
        n_fft=n_fft,
        hop_length=hop_length,
        power=power,
        win_length=win_length,
        window=window,
        center=center,
        pad_mode=pad_mode,
    )

    # Build a Mel filter
    mel_basis = filters.mel(sr, n_fft, **kwargs)

    return np.dot(mel_basis, S)

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33

输入参数:
y = 32000 个点的音频数据,
sr = 4000;
n_mels = 256;
fmin = 50 , fmax = 2000
n_fft: fft 点数800;
hop_length = win_length / 4 = n_fft / 4

返回:
mel 语谱图的输出大小为 [n_mels, n_frames ],

Mel 语谱图的输出结果是, Mel 滤波器组 和 功率谱格式的语谱图 两者共同作用的结果。

[ n_mels, 1 + n f f t 2 \frac{n_{fft}}{2} 2nfft ] ×[ 1 + n f f t 2 \frac{n_{fft}}{2} 2nfft, n_frames ]

= [n_mels, n_frames ]

注意:

Mel 语谱图函数的输入, 其中的S 这个参数量, 代表了输入中是否提供了语谱图的输入。

  1. S 如果提供了语谱图输入, 此时 S 直接使用输入的语谱图;
  2. S 如果没有提供语谱图的输入, 需要根据输入的一维数据y, 先经过STFT 得到频谱, 然后在取模值的平方, 得到频谱的功率谱形式, 此时的功率谱便作为语谱图。

最终 mel 语谱图的输出大小为 [n_mels, frames ], 由 Mel 滤波器组和短时傅里叶变换STFT 两个 输出矩阵,相乘共同作用到的结果。

该函数中调用了两个函数实现:

2.1 STFT 的输出帧数

根据以上参数

输入参数:
y = 32000 个点的音频数据,
sr = 4000;
n_mels = 256;
fmin = 50 , fmax = 2000
n_fft: fft 点数800;
hop_length = win_length / 4 = n_fft / 4

调用 STFT ,得到的输出 [ 1 + n f f t 2 \frac{n_{fft}}{2} 2nfft, n_frames ]

  1. 行数: 1 + n f f t 2 \frac{n_{fft}}{2} 2nfft = 1+ (800/2) = 401

  2. 列数:等于帧数, 帧数 n_frames 求法:
    n f r a m e s = a u d i o l e n g t h / / h o p l e n g t h + 1 , n_{frames} = audio_{length} // hop_{length} + 1, nframes=audiolength//hoplength+1,
    带入上式中的, 可以求得:
    n_frams = (32000 // 200) + 1 = 161,

故此时的STFT 输出 [ 401, 161]

2.2 filter.mels(sr, n_fft, **kwargs)

调用 filter.mels(sr, n_fft, n_mels=128, fmin=0.0, fmax=None, htk=False, norm=1 ) 函数, 生成一个 Mel 滤波器组;

输入参数:
y = 32000 个点的音频数据,
sr = 4000;
n_mels = 256;
fmin = 50 , fmax = 2000
n_fft: fft 点数800;
hop_length = win_length / 4 = n_fft / 4

其中有几个重要的参数,需要关注以下

  1. n f f t n_{fft} nfft 的点数选取, n f f t n_{fft} nfft 的点数越多, CPU需要的运算量也增加,可以分别的频率也精细;

该参数与与频率分辨率有关:
sr 采样率为 16 KHZ;
假设将音频的频率分辨率设置为 10 HZ,

频率分辨率 bin = F s N f f t \frac{F_{s}}{N_{fft}} NfftFs = 采 样 率 N f f t 的 点 数 \frac{采样率}{N_{fft}的点数} Nfft

N f f t N_{fft} Nfft = F s 10 \frac{F_{s}}{10} 10Fs = 16000 10 \frac{16000}{10} 1016000 =1600 点;

2.2.1 Mel 频率

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/羊村懒王/article/detail/288704
推荐阅读
相关标签
  

闽ICP备14008679号