赞
踩
窗口的长度与 n_fft 需要匹配大小长度;
librosa.feature.melspectrogram()
函数在spectral.py
中,实现过程为:
def melspectrogram(y=None, sr=22050, S=None, n_fft=2048, hop_length=512,
power=2.0, **kwargs):
S, n_fft = _spectrogram(y=y, S=S, n_fft=n_fft, hop_length=hop_length, power=power)
# Build a Mel filter
mel_basis = filters.mel(sr, n_fft, **kwargs)
return np.dot(mel_basis, S)
可以看出 Mel_ 语谱图的计算主要有两个函数构成
_spectrogram()
函数实现filters.mel
函数实现;_spectrogram()
函数实现def _spectrogram(y=None, S=None, n_fft=2048, hop_length=512, power=1):
if S is not None:
# Infer n_fft from spectrogram shape
n_fft = 2 * (S.shape[0] - 1)
else:
# Otherwise, compute a magnitude spectrogram from input
S = np.abs(stft(y, n_fft=n_fft, hop_length=hop_length))**power
return S, n_fft
其中的核心代码部分为stft()函数,短时傅里叶变化stft函数的实现过程为:
def stft(y, n_fft=2048, hop_length=None, win_length=None, window='hann', center=True, dtype=np.complex64, pad_mode='reflect'): # By default, use the entire frame if win_length is None: win_length = n_fft # Set the default hop, if it's not already specified if hop_length is None: hop_length = int(win_length // 4) fft_window = get_window(window, win_length, fftbins=True) # Pad the window out to n_fft size fft_window = util.pad_center(fft_window, n_fft) # Reshape so that the window can be broadcast fft_window = fft_window.reshape((-1, 1)) # Check audio is valid util.valid_audio(y) # Pad the time series so that frames are centered if center: y = np.pad(y, int(n_fft // 2), mode=pad_mode) # Window the time series. y_frames = util.frame(y, frame_length=n_fft, hop_length=hop_length) # Pre-allocate the STFT matrix stft_matrix = np.empty((int(1 + n_fft // 2), y_frames.shape[1]), dtype=dtype, order='F') # how many columns can we fit within MAX_MEM_BLOCK? n_columns = int(util.MAX_MEM_BLOCK / (stft_matrix.shape[0] * stft_matrix.itemsize)) for bl_s in range(0, stft_matrix.shape[1], n_columns): bl_t = min(bl_s + n_columns, stft_matrix.shape[1]) stft_matrix[:, bl_s:bl_t] = fft.fft(fft_window * y_frames[:, bl_s:bl_t], axis=0)[:stft_matrix.shape[0]] return stft_matrix
filters.mel
函数实现输出是 [ n_mels, 1 + (n_fft / 2) ]
这其中需要的注意的是,不同的librosa版本的mel()函数的归一化参数传递的也是不一样的,比如python3.6中的norm可以传递1 或是 None,但是python3.7中就需要传递norm=‘slaney’.
产生一个线性变换矩阵, 将FFT 频率下的bin 的 转化到 Mel 尺度下的bin
@cache(level=10) def mel( sr, n_fft, n_mels=128, fmin=0.0, fmax=None, htk=False, norm="slaney", dtype=np.float32, ): """Create a Mel filter-bank. This produces a linear transformation matrix to project FFT bins onto Mel-frequency bins. Parameters ---------- sr : number > 0 [scalar] sampling rate of the incoming signal n_fft : int > 0 [scalar] number of FFT components n_mels : int > 0 [scalar] number of Mel bands to generate fmin : float >= 0 [scalar] lowest frequency (in Hz) fmax : float >= 0 [scalar] highest frequency (in Hz). If `None`, use ``fmax = sr / 2.0`` htk : bool [scalar] use HTK formula instead of Slaney norm : {None, 'slaney', or number} [scalar] If 'slaney', divide the triangular mel weights by the width of the mel band (area normalization). If numeric, use `librosa.util.normalize` to normalize each filter by to unit l_p norm. See `librosa.util.normalize` for a full description of supported norm values (including `+-np.inf`). Otherwise, leave all the triangles aiming for a peak value of 1.0 dtype : np.dtype The data type of the output basis. By default, uses 32-bit (single-precision) floating point. Returns ------- M : np.ndarray [shape=(n_mels, 1 + n_fft/2)] Mel transform matrix See also -------- librosa.util.normalize Notes ----- This function caches at level 10. Examples -------- >>> melfb = librosa.filters.mel(22050, 2048) >>> melfb array([[ 0. , 0.016, ..., 0. , 0. ], [ 0. , 0. , ..., 0. , 0. ], ..., [ 0. , 0. , ..., 0. , 0. ], [ 0. , 0. , ..., 0. , 0. ]]) Clip the maximum frequency to 8KHz >>> librosa.filters.mel(22050, 2048, fmax=8000) array([[ 0. , 0.02, ..., 0. , 0. ], [ 0. , 0. , ..., 0. , 0. ], ..., [ 0. , 0. , ..., 0. , 0. ], [ 0. , 0. , ..., 0. , 0. ]]) >>> import matplotlib.pyplot as plt >>> fig, ax = plt.subplots() >>> img = librosa.display.specshow(melfb, x_axis='linear', ax=ax) >>> ax.set(ylabel='Mel filter', title='Mel filter bank') >>> fig.colorbar(img, ax=ax) """ if fmax is None: fmax = float(sr) / 2 # Initialize the weights n_mels = int(n_mels) weights = np.zeros((n_mels, int(1 + n_fft // 2)), dtype=dtype) # Center freqs of each FFT bin fftfreqs = fft_frequencies(sr=sr, n_fft=n_fft) # 'Center freqs' of mel bands - uniformly spaced between limits mel_f = mel_frequencies(n_mels + 2, fmin=fmin, fmax=fmax, htk=htk) fdiff = np.diff(mel_f) ramps = np.subtract.outer(mel_f, fftfreqs) for i in range(n_mels): # lower and upper slopes for all bins lower = -ramps[i] / fdiff[i] upper = ramps[i + 2] / fdiff[i + 1] # .. then intersect them with each other and zero weights[i] = np.maximum(0, np.minimum(lower, upper)) if norm == "slaney": # Slaney-style mel is scaled to be approx constant energy per channel enorm = 2.0 / (mel_f[2 : n_mels + 2] - mel_f[:n_mels]) weights *= enorm[:, np.newaxis] else: weights = util.normalize(weights, norm=norm, axis=-1) # Only check weights if f_mel[0] is positive if not np.all((mel_f[:-2] == 0) | (weights.max(axis=1) > 0)): # This means we have an empty channel somewhere warnings.warn( "Empty filters detected in mel frequency basis. " "Some channels will produce empty responses. " "Try increasing your sampling rate (and fmax) or " "reducing n_mels." ) return weights
假设 调用形式如下 :
librosa.feature.melspectrogram(y=current_window ,sr=sample_rate, n_mels=n_mels, fmin=f_min, fmax=f_max, n_fft=nfft, hop_length=hop)
def melspectrogram( y=None, sr=22050, S=None, n_fft=2048, hop_length=512, win_length=None, window="hann", center=True, pad_mode="reflect", power=2.0, **kwargs, ): S, n_fft = _spectrogram( y=y, S=S, n_fft=n_fft, hop_length=hop_length, power=power, win_length=win_length, window=window, center=center, pad_mode=pad_mode, ) # Build a Mel filter mel_basis = filters.mel(sr, n_fft, **kwargs) return np.dot(mel_basis, S)
输入参数:
y = 32000 个点的音频数据,
sr = 4000;
n_mels = 256;
fmin = 50 , fmax = 2000
n_fft: fft 点数800;
hop_length = win_length / 4 = n_fft / 4
返回:
mel 语谱图的输出大小为 [n_mels, n_frames ],
Mel 语谱图的输出结果是, Mel 滤波器组 和 功率谱格式的语谱图 两者共同作用的结果。
[ n_mels, 1 + n f f t 2 \frac{n_{fft}}{2} 2nfft ] ×[ 1 + n f f t 2 \frac{n_{fft}}{2} 2nfft, n_frames ]
= [n_mels, n_frames ]
注意:
Mel 语谱图函数的输入, 其中的
S
这个参数量, 代表了输入中是否提供了语谱图的输入。
S
如果提供了语谱图输入, 此时S
直接使用输入的语谱图;S
如果没有提供语谱图的输入, 需要根据输入的一维数据y
, 先经过STFT 得到频谱, 然后在取模值的平方, 得到频谱的功率谱形式, 此时的功率谱便作为语谱图。
最终 mel 语谱图的输出大小为 [n_mels, frames ], 由 Mel 滤波器组和短时傅里叶变换STFT 两个 输出矩阵,相乘共同作用到的结果。
该函数中调用了两个函数实现:
根据以上参数
输入参数:
y = 32000 个点的音频数据,
sr = 4000;
n_mels = 256;
fmin = 50 , fmax = 2000
n_fft: fft 点数800;
hop_length = win_length / 4 = n_fft / 4
调用 STFT ,得到的输出 [ 1 + n f f t 2 \frac{n_{fft}}{2} 2nfft, n_frames ]
行数: 1 + n f f t 2 \frac{n_{fft}}{2} 2nfft = 1+ (800/2) = 401
列数:等于帧数, 帧数 n_frames
求法:
n
f
r
a
m
e
s
=
a
u
d
i
o
l
e
n
g
t
h
/
/
h
o
p
l
e
n
g
t
h
+
1
,
n_{frames} = audio_{length} // hop_{length} + 1,
nframes=audiolength//hoplength+1,
带入上式中的, 可以求得:
n_frams = (32000 // 200) + 1 = 161,
故此时的STFT 输出 [ 401, 161]
调用 filter.mels(sr, n_fft, n_mels=128, fmin=0.0, fmax=None, htk=False, norm=1 )
函数, 生成一个 Mel 滤波器组;
输入参数:
y = 32000 个点的音频数据,
sr = 4000;
n_mels = 256;
fmin = 50 , fmax = 2000
n_fft: fft 点数800;
hop_length = win_length / 4 = n_fft / 4
其中有几个重要的参数,需要关注以下
该参数与与频率分辨率有关:
sr
采样率为 16 KHZ;
假设将音频的频率分辨率设置为 10 HZ,
频率分辨率 bin = F s N f f t \frac{F_{s}}{N_{fft}} NfftFs = 采 样 率 N f f t 的 点 数 \frac{采样率}{N_{fft}的点数} Nfft的点数采样率
N f f t N_{fft} Nfft = F s 10 \frac{F_{s}}{10} 10Fs = 16000 10 \frac{16000}{10} 1016000 =1600 点;
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。