L i *ddZddlZddlZddlZddlZddlZddlmZddlmZddl m Z m Z m Z m Z e rddlZddlZddlZddlmZddlmZmZmZmZmZmZerddlZerddlZddlZer+ej>ej@j'd Z!e ejDd eejDed fZ#dTd e e$ejDfd ejDfd Z%dUd e e$ejDfd ejDfdZ&dTd e e$ejDfd ejDfdZ' dVd e$de$de e(de)de e(d e e$e*e$e fejdff dZ+dZ,dZ-d e e.e#e#fd e#fdZ/dWde e0ejDfde$d e e0ejDffdZ1dWde e0ejDfde$d e e0ejDffdZ2dXde e0ejDfde0d e(fd!Z3d"ejDd#ejDd ejDfd$Z4 dYd&e(d'e(de(de0d(e e0d)e e5e0e0fd*e)fd+Z6 dZd&e(d,e(d-e0d.e0de(d/e e$de$d0e)d ejDfd1Z7d2e(d e(fd3Z8 d[d2e(d4e$d5e)d6e e(d7e)d ejDf d8Z9dd9d%d:d%dddd;dd9d;ddejtfde(d?e e(d(e e0d7e)d@e$dAe)dBe0dCe e0dDe ejDdEe0dFe e$dGe0dHe0dIe e0dJe)dKejvd ejDf(dLZe(d?e e(d(e e0d7e)d@e$dAe)dBe0dCe e0dDe ejDdEe0dFe e$dGe0dHe0dIe e0dJe)dKejvd e.ejDf(dNZ= d\dOejDdGe0dHe0dIe e0d ejDf dPZ> d\dOejDdGe0dHe0dIe e0d ejDf dQZ? d]dOejDdGe0dHe0dIe e0d ejDf dRZ@ d]dOejDdGe0dHe0dIe e0d ejDf dSZAy)^z Audio processing functions to extract features from audio waveforms. This code is pure numpy to support all frameworks and remove unnecessary dependencies. N)Sequence)BytesIO) TYPE_CHECKINGAnyOptionalUnion)version)is_librosa_availableis_numpy_arrayis_soundfile_availableis_torch_tensoris_torchcodec_availablerequires_backends torchcodecz torch.Tensoraudioreturnct|trEtr+tt j dk\rt ||}|St|||}|St|tjs td|S)a7 Loads `audio` to an np.ndarray object. Args: audio (`str` or `np.ndarray`): The audio to be loaded to the numpy array format. sampling_rate (`int`, *optional*, defaults to 16000): The sampling rate to be used when loading the audio. It should be same as the sampling rate the model you will be using further was trained with. timeout (`float`, *optional*): The timeout value in seconds for the URL request. Returns: `np.ndarray`: A numpy array representing the audio. z0.3.0) sampling_rate)rtimeoutzfIncorrect format used for `audio`. Should be an url linking to an audio, a local path, or numpy array.) isinstancestrrTORCHCODEC_VERSIONr parseload_audio_torchcodecload_audio_librosanpndarray TypeErrorrrrs ^/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/transformers/audio_utils.py load_audior"<sy % # $);w}}W?U)U)%}ME L 'uMSZ[E L rzz * t   Lcttdgddlm}|||d}|j j dj }|S)a Loads `audio` to an np.ndarray object using `torchcodec`. Args: audio (`str` or `np.ndarray`): The audio to be loaded to the numpy array format. sampling_rate (`int`, *optional*, defaults to 16000): The sampling rate to be used when loading the audio. It should be same as the sampling rate the model you will be using further was trained with. Returns: `np.ndarray`: A numpy array representing the audio. rr) AudioDecoderr ) sample_rate num_channels)rrtorchcodec.decodersr%get_all_samplesdatanumpy)rrr%decoders r!rrK+l^<05m!LG  # # % * *1 - 3 3 5E Lr#cfttdg|jds|jdrDtjt t j||j|d}|Stjj|rtj||d}|S)aG Loads `audio` to an np.ndarray object using `librosa`. Args: audio (`str` or `np.ndarray`): The audio to be loaded to the numpy array format. sampling_rate (`int`, *optional*, defaults to 16000): The sampling rate to be used when loading the audio. It should be same as the sampling rate the model you will be using further was trained with. timeout (`float`, *optional*): The timeout value in seconds for the URL request. Returns: `np.ndarray`: A numpy array representing the audio. librosahttp://https://r)srr) rr startswithr.loadrrequestsgetcontentospathisfiler s r!rrss (9+6  "e&6&6z&B WX\\%%I%Q%QRWdefgh L   U}5a8 Lr#F return_formatr force_monorc ttdg|dvrtd|d d}|jdr4t j ||}|j |j}nStjj|r&t|d5}|j}dddntd |tj|5}tj |5}|jd } |j"} |j$} ||| k7rt'j(| | |d } n| }dddddd|r! j*dk7r| j-d} tj} tj.|  | j1| j3d|dk(r| S|dk(r2t5j6| jj9dS|dk(rCt5j6| jj9d| j;dSy#1swYxYw#1swYxYw#1swYxYw#t<$r} td| d} ~ wwxYw)a Load audio from either a local file path or URL and return in specified format. Args: audio (`str`): Either a local file path or a URL to an audio file return_format (`str`): Format to return the audio in: - "base64": Base64 encoded string - "dict": Dictionary with data and format - "buffer": BytesIO object timeout (`int`, *optional*): Timeout for URL requests in seconds force_mono (`bool`): Whether to convert stereo audio to mono sampling_rate (`int`, *optional*): If provided, the audio will be resampled to the specified sampling rate. Returns: `Union[str, Dict[str, Any], io.BytesIO, None]`: - `str`: Base64 encoded audio data (if return_format="base64") - `dict`: Dictionary with 'data' (base64 encoded audio data) and 'format' keys (if return_format="dict") - `io.BytesIO`: BytesIO object containing audio data (if return_format="buffer") r.)base64dictbufferzInvalid return_format: z'. Must be 'base64', 'dict', or 'buffer'N)r/r0r1rbzFile not found: float32dtypeHQ)qualityr axis)formatrr@r>zutf-8r?)r*rIzError loading audio: )r load_audio_as ValueErrorr3r5r6raise_for_statusr7r8r9r:openreadiorsf SoundFile sampleraterIsoxrresamplendimmeanwriteupperseekr> b64encodedecodelower Exception)rr;rr<r audio_bytesresponse audio_filef audio_array original_sr audio_formatr@es r!rJrJsJ6mi[1882=/Ahijj,6   3 4||E7;H  % % '"**K WW^^E "eT" 0j(oo/  0 0/w78 8ZZ $ 0 j) 0Qff9f5 ll  xx  ,+1M"&-- [-ae"fK$/M 0 0 +**a/%***2K mL>wG&,,. %; 0 0 0 0 0 0: 604556soA2I&H?),I&I+AI 9IA=I&?6I&6AI&?I I& I II#I&& J/I==Jc2t|xs t|SN)r rrs r!is_valid_audioris % :OE$::r#c.|xrtd|DS)Nc32K|]}t|ywrg)ri).0audio_is r! z)is_valid_list_of_audio..sFW0Fs)allrhs r!is_valid_list_of_audiorps  FSFFFFr#czt|ttfr t|r|St |r|gSt d)z Ensure that the output is a list of audio. Args: audio (`Union[list[AudioInput], AudioInput]`): The input audio. Returns: list: A list of audio. z=Invalid input type. Must be a single audio or a list of audio)rlisttuplerprirKrhs r!make_list_of_audiorts<%$',B5,I ew T UUr#freq mel_scalec|dvr td|dk(rdtjd|dz zzS|dk(rdtjd|dz zzSd }d }d tjd z }d |zdz }t |tj r+||k\}|tj|||z |zz||<|S||k\r|tj||z |zz}|S)a Convert frequency from hertz to mels. Args: freq (`float` or `np.ndarray`): The frequency, or multiple frequencies, in hertz (Hz). mel_scale (`str`, *optional*, defaults to `"htk"`): The mel frequency scale to use, `"htk"`, `"kaldi"` or `"slaney"`. Returns: `float` or `np.ndarray`: The frequencies on the mel scale. slaneyhtkkaldi6mel_scale should be one of "htk", "slaney" or "kaldi".rzF@?@r{@@@.@;@皙@@i@)rKrlog10logrr)rurv min_log_hertz min_log_mellogstepmels log_regions r! hertz_to_melrs22QRREu !5666 g sdUl3444MKRVVC[ G : D$ #]* &Z0@=0P)QT[)[[Z K  RVVD=$89GCC Kr#rc|dvr td|dk(rdtjd|dz dz zS|dk(rdtj|d z dz zSd }d }tjd d z }d|zdz }t |tj r+||k\}|tj||||z zz||<|S||k\r|tj|||z zz}|S)af Convert frequency from mels to hertz. Args: mels (`float` or `np.ndarray`): The frequency, or multiple frequencies, in mels. mel_scale (`str`, *optional*, `"htk"`): The mel frequency scale to use, `"htk"`, `"kaldi"` or `"slaney"`. Returns: `float` or `np.ndarray`: The frequencies in hertz. rxr|rzr r}r~r{rrrrrrr)rKrpowerexprrr)rrvrrrrurs r! mel_to_hertzrs22QRRETF]3c9:: g tf}-344MKffSkD G 4<# D$ #[( (266'T*=MP[=[2\+]]Z K  rvvg 1C&DEE Kr#tuningbins_per_octavecddd||z zz}tj|t|dz z }|S)a Convert frequency from hertz to fractional octave numbers. Adapted from *librosa*. Args: freq (`float` or `np.ndarray`): The frequency, or multiple frequencies, in hertz (Hz). tuning (`float`, defaults to `0.`): Tuning deviation from the Stuttgart pitch (A440) in (fractional) bins per octave. bins_per_octave (`int`, defaults to `12`): Number of bins per octave. Returns: `float` or `np.ndarray`: The frequencies on the octave scale. g{@@)rlog2float)rurrstuttgart_pitchoctaves r!hertz_to_octaverBs: cf&>??O WWTU?3b89 :F Mr# fft_freqs filter_freqscFtj|}tj|dtj|dz }|ddddf |ddz }|ddddf|ddz }tjtjdtj ||S)a Creates a triangular filter bank. Adapted from *torchaudio* and *librosa*. Args: fft_freqs (`np.ndarray` of shape `(num_frequency_bins,)`): Discrete frequencies of the FFT bins in Hz. filter_freqs (`np.ndarray` of shape `(num_mel_filters,)`): Center frequencies of the triangular filters to create, in Hz. Returns: `np.ndarray` of shape `(num_frequency_bins, num_mel_filters)` rr N)rdiff expand_dimsmaximumzerosminimum)rr filter_diffslopes down_slopes up_slopess r!_create_triangular_filter_bankrWs'','K ^^L! ,r~~i/K KF!SbS&>/K$44Kq!"u AB/I ::bhhqk2::k9#E FFr#Tnum_frequency_bins num_chromarweighting_parametersstart_at_c_chromac tjd||ddd}|t|||z}tj|dd|zz g|f}tjtj|dd|ddz d dgf} tj j |tjd|d j} tjt|d z } tj| | zd |zz|| z } tjdd | ztj| |dfz d zz} |$| tj| |zddd |z zz } |B|\} } | tjtjd||z | z | z d zz|dfz} |rtj| d|dzzd} tj | dddt#d|d z zfS)a Creates a chroma filter bank, i.e a linear transformation to project spectrogram bins onto chroma bins. Adapted from *librosa*. Args: num_frequency_bins (`int`): Number of frequencies used to compute the spectrogram (should be the same as in `stft`). num_chroma (`int`): Number of chroma bins (i.e pitch classes). sampling_rate (`float`): Sample rate of the audio waveform. tuning (`float`): Tuning deviation from A440 in fractions of a chroma bin. power (`float`, *optional*, defaults to 2.0): If 12.0, normalizes each column with their L2 norm. If 1.0, normalizes each column with their L1 norm. weighting_parameters (`tuple[float, float]`, *optional*, defaults to `(5., 2.)`): If specified, apply a Gaussian weighting parameterized by the first element of the tuple being the center and the second element being the Gaussian half-width. start_at_c_chroma (`bool`, *optional*, defaults to `True`): If True, the filter bank will start at the 'C' pitch class. Otherwise, it will start at 'A'. Returns: `np.ndarray` of shape `(num_frequency_bins, num_chroma)` rF)endpointr N)rrg?rr~drCrrgTrHkeepdims rG)rlinspacer concatenatersubtractouterarangeTroundr remainderrtilesumrollascontiguousarrayint)rrrrrrr frequencies freq_bins bins_widthchroma_filters num_chroma2center half_widths r!chroma_filter_bankrmsD++a0BUSTUTVWK_[YcddI1j0@!@ A9MNIIabMIcrN,JC!PSTRU VWJ[[&&y"))AzQT2UVXXN((5,q01K \\.;">j"PR\]`kkNVVDA$6jZ[_9]$]bc#ccdN '"&&1FQY]*^cfincn*oo'1 "'' FF4Y3f< JqPQ R O   zR7G1HqQ   q2SC= 2zRequire min_frequency: z <= max_frequency: )rvr rrrGrzNAt least one mel filter has all zero values. The value for `num_mel_filters` (z?) may be set too high. Or, the value for `num_frequency_bins` (z) may be set too low.) rKrrrrrrrmaxanywarningswarn)rrrrrrrvrmel_minmel_max mel_freqsr fft_bin_widthr mel_filtersenorms r!mel_filter_bankrsj DH,?@@A78J7K5QRR}$2=/ATUbTcdee=I>G=I>G GWo.ABI Y?L!%*|dk(r*tjtj|d}nt d|d|r|d d }||S||kDrt d |d |d tj |}|r||z dznd}|||||z|S)a` Returns an array containing the specified window. This window is intended to be used with `stft`. The following window types are supported: - `"boxcar"`: a rectangular window - `"hamming"`: the Hamming window - `"hann"`: the Hann window - `"povey"`: the Povey window Args: window_length (`int`): The length of the window in samples. name (`str`, *optional*, defaults to `"hann"`): The name of the window function. periodic (`bool`, *optional*, defaults to `True`): Whether the window is periodic or symmetric. frame_length (`int`, *optional*): The length of the analysis frames in samples. Provide a value for `frame_length` if the window is smaller than the frame length, so that it will be zero-padded. center (`bool`, *optional*, defaults to `True`): Whether to center the window inside the FFT buffer. Only used when `frame_length` is provided. Returns: `np.ndarray` of shape `(window_length,)` or `(frame_length,)` containing the window. r boxcar)hamminghamming_window)hann hann_windowpoveyg333333?zUnknown window function ''NrLength of the window (z') may not be larger than frame_length ()rr)ronesrhanningrrKr) rrrrrlengthwindow padded_windowoffsets r!window_functionr#s B#+]Q  F x . .F# ( (F# "**V,d34TF!<== |#$]O3Z[gZhhi j  HH\*M4:l]*q 0F5;M&6M12 r#r~reflect绽|=waveformr hop_length fft_lengthpad_modeonesideddither preemphasisr mel_floorlog_mel reference min_valuedb_rangeremove_dc_offsetrDc t|}||}||kDrtd|d|d||k7rtd|d|d|dkr td|jd k7rtd |jt j |r td | | td |r5t |d zt |d zfg}t j|||}|jtj}|jtj}t d t j|j|z |z z}|r|d zd zn|}t j||ftj}|rtjjntjj}t j |}d}t#|D]}||||z|d|| dk7r.|d|xxx| tj$j'|zz ccc|r|d||d|j)z |d|| '|d |xxx| |d|d z zzccc|dxxd | z zcc<|d|xxx|zccc||||<||z }|(t j*|tj|z}|j,}| 4t j.| t j0| j,|}|| | dk(rt j2|}ng| dk(rt j4|}nL| dk(r9|dk(rt7||||}n3|dk(rt9||||}ntd| d|td| t j:||}|S)av Calculates a spectrogram over one waveform using the Short-Time Fourier Transform. This function can create the following kinds of spectrograms: - amplitude spectrogram (`power = 1.0`) - power spectrogram (`power = 2.0`) - complex-valued spectrogram (`power = None`) - log spectrogram (use `log_mel` argument) - mel spectrogram (provide `mel_filters`) - log-mel spectrogram (provide `mel_filters` and `log_mel`) How this works: 1. The input waveform is split into frames of size `frame_length` that are partially overlapping by `frame_length - hop_length` samples. 2. Each frame is multiplied by the window and placed into a buffer of size `fft_length`. 3. The DFT is taken of each windowed frame. 4. The results are stacked into a spectrogram. We make a distinction between the following "blocks" of sample data, each of which may have a different lengths: - The analysis frame. This is the size of the time slices that the input waveform is split into. - The window. Each analysis frame is multiplied by the window to avoid spectral leakage. - The FFT input buffer. The length of this determines how many frequency bins are in the spectrogram. In this implementation, the window is assumed to be zero-padded to have the same size as the analysis frame. A padded window can be obtained from `window_function()`. The FFT input buffer may be larger than the analysis frame, typically the next power of two. Note: This function is not optimized for speed yet. It should be mostly compatible with `librosa.stft` and `torchaudio.functional.transforms.Spectrogram`, although it is more flexible due to the different ways spectrograms can be constructed. Args: waveform (`np.ndarray` of shape `(length,)`): The input waveform. This must be a single real-valued, mono waveform. window (`np.ndarray` of shape `(frame_length,)`): The windowing function to apply, including zero-padding if necessary. The actual window length may be shorter than `frame_length`, but we're assuming the array has already been zero-padded. frame_length (`int`): The length of the analysis frames in samples. With librosa this is always equal to `fft_length` but we also allow smaller sizes. hop_length (`int`): The stride between successive analysis frames in samples. fft_length (`int`, *optional*): The size of the FFT buffer in samples. This determines how many frequency bins the spectrogram will have. For optimal speed, this should be a power of two. If `None`, uses `frame_length`. power (`float`, *optional*, defaults to 1.0): If 1.0, returns the amplitude spectrogram. If 2.0, returns the power spectrogram. If `None`, returns complex numbers. center (`bool`, *optional*, defaults to `True`): Whether to pad the waveform so that frame `t` is centered around time `t * hop_length`. If `False`, frame `t` will start at time `t * hop_length`. pad_mode (`str`, *optional*, defaults to `"reflect"`): Padding mode used when `center` is `True`. Possible values are: `"constant"` (pad with zeros), `"edge"` (pad with edge values), `"reflect"` (pads with mirrored values). onesided (`bool`, *optional*, defaults to `True`): If True, only computes the positive frequencies and returns a spectrogram containing `fft_length // 2 + 1` frequency bins. If False, also computes the negative frequencies and returns `fft_length` frequency bins. dither (`float`, *optional*, defaults to 0.0): Adds dithering. In other words, adds a small Gaussian noise to each frame. E.g. use 4.0 to add dithering with a normal distribution centered around 0.0 with standard deviation 4.0, 0.0 means no dithering. Dithering has similar effect as `mel_floor`. It reduces the high log_mel_fbank values for signals with hard-zero sections, when VAD cutoff is present in the signal. preemphasis (`float`, *optional*) Coefficient for a low-pass filter that applies pre-emphasis before the DFT. mel_filters (`np.ndarray` of shape `(num_freq_bins, num_mel_filters)`, *optional*): The mel filter bank. If supplied, applies a this filter bank to create a mel spectrogram. mel_floor (`float`, *optional*, defaults to 1e-10): Minimum value of mel frequency banks. log_mel (`str`, *optional*): How to convert the spectrogram to log scale. Possible options are: `None` (don't convert), `"log"` (take the natural logarithm) `"log10"` (take the base-10 logarithm), `"dB"` (convert to decibels). Can only be used when `power` is not `None`. reference (`float`, *optional*, defaults to 1.0): Sets the input spectrogram value that corresponds to 0 dB. For example, use `np.max(spectrogram)` to set the loudest part to 0 dB. Must be greater than zero. min_value (`float`, *optional*, defaults to `1e-10`): The spectrogram will be clipped to this minimum value before conversion to decibels, to avoid taking `log(0)`. For a power spectrogram, the default of `1e-10` corresponds to a minimum of -100 dB. For an amplitude spectrogram, the value `1e-5` corresponds to -100 dB. Must be greater than zero. db_range (`float`, *optional*): Sets the maximum dynamic range in decibels. For example, if `db_range = 80`, the difference between the peak value and the smallest value will never be more than 80 dB. Must be greater than zero. remove_dc_offset (`bool`, *optional*): Subtract mean from waveform on each frame, applied before pre-emphasis. This should be set to `true` in order to get the same results as `torchaudio.compliance.kaldi.fbank` when computing mel filters. dtype (`np.dtype`, *optional*, defaults to `np.float32`): Data type of the spectrogram tensor. If `power` is None, this argument is ignored and the dtype will be `np.complex64`. Returns: `nd.array` containing a spectrogram of shape `(num_frequency_bins, length)` for a regular spectrogram or shape `(num_mel_filters, length)` for a mel spectrogram. Nframe_length (%) may not be larger than fft_length (rr) must equal frame_length (r$hop_length must be greater than zeror 6Input waveform must have only one dimension, shape is :Complex-valued input waveforms are not currently supportedzYou have provided `mel_filters` but `power` is `None`. Mel spectrogram computation is not yet supported for complex-valued spectrogram.Specify `power` to fix this issue.rmoderCrrrdBr~rCannot use log_mel option ' ' with power Unknown log_mel option: )lenrKrUshaper iscomplexobjrpadastypefloat64floorsizeempty complex64fftrfftrrangerandomrandnrVabsrrdotrramplitude_to_db power_to_dbasarray)rrrrrrrrrrrrrrrrrrrDrpadding num_framesr spectrogramfft_funcr@timestep frame_idxs r!r*r*cslKM! j >,7\]g\hhijkk $1-@[\h[iijkllQ?@@}}QRZR`R`Qabcc x UVV }0 1   )*C 0A,BCD66(G(;rzz*H ]]2:: &FQ8==<#?:"MNNOJ2:*/Q. ((J(:;2<<PK'rvv{{BFFJJH XXj !FH:& (H|4K L}  S= =L !Vbiiool.K%K K ! $*=L$9F=Lwi}UZT[!\]]7yAB Bjje4 r# waveform_listc  t|}||}||kDrtd|d|d||k7rtd|d|d|dkr td|D]I}|jd k7rtd |jt j |s@td |rBt |d zt |d zfg}|Dcgc]}t j||| }}|Dcgc] }t|}}t|}t j|Dcgc])}t j|d|t|z fdd+c}|}|jtj}|jtj}t d t j|jd |z |z z}|Dcgc])}t d t j||z |z z+}}|jd}|r|d zd zn|}t j|||ftj}|rtjj ntjj}t j"||f} t%|D]}!|!|z}"|dd|"|"|zf| ddd|f<| dk7rC| ddd|fxx| t j&j(| ddd|fjzz cc<|r-| ddd|fxx| ddd|fj+d dzcc<| 6| ddd |fxx| | ddd|d z fzzcc<| dddfxxd | z zcc<| ddd|fxx|zcc<|| |dd|!f<|(t j,|tj|z}| ,7\]g\hhijkk $1-@[\h[iijkllQ?@@"[ ==A UV^VdVdUefg g ??8 $YZ Z [  )*C 0A,BCD*   FF   '4!"H !! ./JHH*  FF8ac(m!;<:_` a  288D ]]2:: &FQ#8#>#>q#AL#PT^"^__`J]vwSYs1rxx,)>*(LMMNwOw'--a0K2:*/Q. ((K5GHPRP\P\]K'rvv{{BFFJJH XX{J/ 0F:&5 z)#8HxR^G^<^9^#_q-<- S= 1m|m# $&M\MIYBZB`B`1a(a a $  1m|m# $q-<-/?(@(E(E1W[(E(\ \ $  " 1a n$ %vaASS7T)T T % 1a4LA O +Lq-<- F* $,V$4 AyL!!5& ff[ ;uD k;==aSzJjjF3  W0 e &&-K  ((;/K _|3KIW_` #/ Y S[\  #>wi}UZT[!\]]7yAB Bjje4 KPQTUdQeKfga A';);';Q$>?AAgg m ! xnhs?R>$S.S.S  Sr*c`|dkr td|dkr tdt||}tj||d}dtj|tj|z z}|9|dkr tdtj||j|z d}|S)a Converts a power spectrogram to the decibel scale. This computes `10 * log10(spectrogram / reference)`, using basic logarithm properties for numerical stability. The motivation behind applying the log function on the (mel) spectrogram is that humans do not hear loudness on a linear scale. Generally to double the perceived volume of a sound we need to put 8 times as much energy into it. This means that large variations in energy may not sound all that different if the sound is loud to begin with. This compression operation makes the (mel) spectrogram features match more closely what humans actually hear. Based on the implementation of `librosa.power_to_db`. Args: spectrogram (`np.ndarray`): The input power (mel) spectrogram. Note that a power spectrogram has the amplitudes squared! reference (`float`, *optional*, defaults to 1.0): Sets the input spectrogram value that corresponds to 0 dB. For example, use `np.max(spectrogram)` to set the loudest part to 0 dB. Must be greater than zero. min_value (`float`, *optional*, defaults to `1e-10`): The spectrogram will be clipped to this minimum value before conversion to decibels, to avoid taking `log(0)`. The default of `1e-10` corresponds to a minimum of -100 dB. Must be greater than zero. db_range (`float`, *optional*): Sets the maximum dynamic range in decibels. For example, if `db_range = 80`, the difference between the peak value and the smallest value will never be more than 80 dB. Must be greater than zero. Returns: `np.ndarray`: the spectrogram in decibels r#reference must be greater than zero#min_value must be greater than zeroNa_mina_max$@"db_range must be greater than zerorKrrcliprr*rrrs r!r&r& sBC>??C>??Iy)I''+YdCK"((;/"((92EEFK s?AB Bggk1BX1MUYZ r#cj|dkr td|dkr tdt||}tj||d}dtj|tj|z z}|>|dkr td|jdd }tj|||z d}|S) aj Converts a batch of power spectrograms to the decibel scale. This computes `10 * log10(spectrogram / reference)`, using basic logarithm properties for numerical stability. This function supports batch processing, where each item in the batch is an individual power (mel) spectrogram. Args: spectrogram (`np.ndarray`): The input batch of power (mel) spectrograms. Expected shape is (batch_size, *spectrogram_shape). Note that a power spectrogram has the amplitudes squared! reference (`float`, *optional*, defaults to 1.0): Sets the input spectrogram value that corresponds to 0 dB. For example, use `np.max(spectrogram)` to set the loudest part to 0 dB. Must be greater than zero. min_value (`float`, *optional*, defaults to `1e-10`): The spectrogram will be clipped to this minimum value before conversion to decibels, to avoid taking `log(0)`. The default of `1e-10` corresponds to a minimum of -100 dB. Must be greater than zero. db_range (`float`, *optional*): Sets the maximum dynamic range in decibels. For example, if `db_range = 80`, the difference between the peak value and the smallest value will never be more than 80 dB. Must be greater than zero. Returns: `np.ndarray`: the batch of spectrograms in decibels rrArBNrCrFrGr rTrrHr*rrr max_valuess r!r6r6<s:C>??C>??Iy)I''+YdCK"((;/"((92EEFK s?AB B __&4_@ ggkh1FdS r#c`|dkr td|dkr tdt||}tj||d}dtj|tj|z z}|9|dkr tdtj||j|z d}|S)a6 Converts an amplitude spectrogram to the decibel scale. This computes `20 * log10(spectrogram / reference)`, using basic logarithm properties for numerical stability. The motivation behind applying the log function on the (mel) spectrogram is that humans do not hear loudness on a linear scale. Generally to double the perceived volume of a sound we need to put 8 times as much energy into it. This means that large variations in energy may not sound all that different if the sound is loud to begin with. This compression operation makes the (mel) spectrogram features match more closely what humans actually hear. Args: spectrogram (`np.ndarray`): The input amplitude (mel) spectrogram. reference (`float`, *optional*, defaults to 1.0): Sets the input spectrogram value that corresponds to 0 dB. For example, use `np.max(spectrogram)` to set the loudest part to 0 dB. Must be greater than zero. min_value (`float`, *optional*, defaults to `1e-5`): The spectrogram will be clipped to this minimum value before conversion to decibels, to avoid taking `log(0)`. The default of `1e-5` corresponds to a minimum of -100 dB. Must be greater than zero. db_range (`float`, *optional*): Sets the maximum dynamic range in decibels. For example, if `db_range = 80`, the difference between the peak value and the smallest value will never be more than 80 dB. Must be greater than zero. Returns: `np.ndarray`: the spectrogram in decibels rrArBNrC4@rGrHrJs r!r%r%ms>C>??C>??Iy)I''+YdCK"((;/"((92EEFK s?AB Bggk1BX1MUYZ r#cj|dkr td|dkr tdt||}tj||d}dtj|tj|z z}|>|dkr td|jdd }tj|||z d}|S) a- Converts a batch of amplitude spectrograms to the decibel scale. This computes `20 * log10(spectrogram / reference)`, using basic logarithm properties for numerical stability. The function supports batch processing, where each item in the batch is an individual amplitude (mel) spectrogram. Args: spectrogram (`np.ndarray`): The input batch of amplitude (mel) spectrograms. Expected shape is (batch_size, *spectrogram_shape). reference (`float`, *optional*, defaults to 1.0): Sets the input spectrogram value that corresponds to 0 dB. For example, use `np.max(spectrogram)` to set the loudest part to 0 dB. Must be greater than zero. min_value (`float`, *optional*, defaults to `1e-5`): The spectrogram will be clipped to this minimum value before conversion to decibels, to avoid taking `log(0)`. The default of `1e-5` corresponds to a minimum of -100 dB. Must be greater than zero. db_range (`float`, *optional*): Sets the maximum dynamic range in decibels. For example, if `db_range = 80`, the difference between the peak value and the smallest value will never be more than 80 dB. Must be greater than zero. Returns: `np.ndarray`: the batch of spectrograms in decibels rrArBNrCrPrGrLTrrHrMs r!r5r5s2C>??C>??Iy)I''+YdCK"((;/"((92EEFK s?AB B __&4_@ ggkh1FdS r#)>N)rR)NFN)rz)rr)rr)g@rT)NrzF)rTNT)r~rN)r~gh㈵>N)B__doc__r> importlibrOr8rcollections.abcrrtypingrrrrtorchr+rr5 packagingr utilsr r r rrr soundfilerPr.rSrmetadatarr AudioInputrr"rrrboolr?rJrirprrrtrrrrrrsrrrrrBrDr*r?r&r6r%r5r#r!r_s  $66&y'9'9'A'A,'OP 2::~x /CXnE]] ^ eCO,TVT^T^>sBJJ!7QSQ[Q[0eCO4\^\f\f:"#' L6 L6L6c]L6 L6 C= L6  3S#X D 01 L6^;GV j!:- .VV,!uUBJJ./!C!ERWY[YcYcRcLd!H!uUBJJ./!C!ERWY[YcYcRcLd!H%rzz 12EZ]*GbjjG GWYWaWaG4 :D"HVHVHVHV  HV E? HV #5#67 HVHVb',[[[[ [  [ 3- [[!%[ZZ[| 5c 5c 5"& << <<3- <  < ZZ <J!% #'(,! $"jj'PjjP JJPP P  P E? P PPP P%P"**%PPc]PP !P"uo#P$%P& 88'P(ZZ)Pp!% #'(,! $"jj'P #P JJPP P  P E? P PPP P%P"**%PPc]PP !P"uo#P$%P& 88'P( "**)Pj $ 0000uo 0 ZZ 0j $ ....uo . ZZ .f $ ....uo . ZZ .dko**(-*@E*X`afXg*ZZ*r#