Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Temporal position of extracted feature frames #129

Open
mart-cerny opened this issue Jan 5, 2019 · 2 comments
Open

Temporal position of extracted feature frames #129

mart-cerny opened this issue Jan 5, 2019 · 2 comments

Comments

@mart-cerny
Copy link

Dear Mr. Kobayashi and Mr. Toda,
I found following issue while rewriting sprocket to C++:
In feature extractor, hop size is given in milliseconds, while in synthesizer, it is given in frames. This results in imperfect alignment of analysis and synthesis frames.

For example for my 1047375 samples long file, pyworld extracts 9501 frames with 5ms frame shift, while MLSADF from pysptk would split the file into 9512 frames. This can result in smeared transients, especially in very long audio files.

Workarounds I found: since I only use DIFF_VC, I completely avoid pyworld and splice file into frames and extract mcep myself:

In feature_extractor.py

shiftl = int(self.fs / 1000 * self.shiftms)
frame_count = int((len(x)-self.fftl)/shiftl)
_mcep = np.zeros([frame_count,dim+1],dtype=np.float)
window_function = np.hanning(self.fftl)
for i in range(frame_count):
frame_pos = i * shiftl
frame = x[frame_pos:frame_pos+self.fftl]
if len(frame)==self.fftl:
_mcep[i] = pysptk.mcep(frame*window_function, dim, alpha)

When using both VC and DIFF_VC, I recommend saving time_axis extracted from pyworld, converting it to sample positions and then using in synthesizer - cannot use high-level synthesis interface anymore.

Hope this information is interesting to you. Thank you for your amazing work,

Best regards, Mart

@unilight
Copy link

@mart-cerny This insight is interesting... So does that mean you're using mcep extracted by pysptk to train the GMM model? How does the converted sound like?

@mart-cerny
Copy link
Author

@unilight yes, the feature_extractor.py shown above is the only point in the whole code creating mceps. I completely got rid of pyworld dependency, which is complicated by the fact the dynamic time warping uses n_pow extracted by pyworld.
Perceptually, I believe the convertion results are indistinguishable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants