Audio

Audio processing operations. Separate, enhance, and diarize audio files.

Separate audio into stems

post

Extract audio stems (vocals, music, effects) from an audio file.

Separation Types

Type
Output Stems
Description

dialogue-me

speech, me

Speech vs music+effects

dialogue-music-effect

speech, music, effect

3-stem separation

speech-nonlingual

speech, nonlingual

Lingual vs non-lingual

Workflow

  1. Upload audio file via POST /v1/upload

  2. Call this endpoint with file_id

  3. Poll GET /v1/jobs/{job_id} for status

  4. Download results from signed URLs when completed

Credits

  • Cost: 1 credit per second of audio

  • Billing: Credits reserved on job creation, confirmed on completion

Authorizations
AuthorizationstringRequired

How to Authenticate

  1. Get your token from Timbr Dashboard
  2. Click the Authorize button above
  3. Enter only the token (e.g., TBR_abc123def456)
  4. Click Authorize then Close

Important

  • โœ… Correct: TBR_8b3364b5a772328a
  • โŒ Wrong: Bearer TBR_8b3364b5a772328a

The 'Bearer ' prefix is added automatically by Swagger UI.

Body

Request to separate audio into stems

The separation is performed by RunPod GPU workers using ASS v2 models. The file must first be uploaded to GCS (Bronze layer) before separation.

file_idstring ยท min: 1Required

File ID of the uploaded audio (from /v1/audio/upload)

separation_typeall ofOptional

Type of separation to perform

Default: dialogue-me
string ยท enumOptional

Available separation types matching RunPod config.yaml

Each type corresponds to a different model and produces different stems:

  • dialogue-me: 2-stem (dialogue, me)
  • dialogue-music-effect: 3-stem (dialogue, music, effect)
  • speech-nonlingual: 2-stem (speech, nonlingual)
Possible values:
sample_rateinteger ยท min: 8000 ยท max: 192000Optional

Output sample rate in Hz

Default: 48000
bit_depthinteger ยท min: 8 ยท max: 32Optional

Output bit depth

Default: 16
output_formatall ofOptional

Output audio format

Default: wav
string ยท enumOptional

Supported output audio formats

Possible values:
Responses
200

Job created successfully

application/json
post
/v1/audio/separate

Enhance audio quality

post

Improve audio quality with noise reduction, EQ adjustment, and normalization.

Enhancement Options

Option
Type
Description

denoise

boolean

Reduce background noise (default: true)

normalize

boolean

Normalize audio levels (default: true)

eq_preset

string

EQ preset: balanced, vocal, bass_boost

noise_reduction

float

Noise reduction strength: 0.0-1.0

Workflow

  1. Upload audio file via POST /v1/upload

  2. Call this endpoint with file_id and enhancement options

  3. Poll GET /v1/jobs/{job_id} for status

  4. Download enhanced audio from signed URL when completed

Credits

  • Cost: 1 credit per second of audio

  • Billing: Credits reserved on job creation, confirmed on completion

Authorizations
AuthorizationstringRequired

How to Authenticate

  1. Get your token from Timbr Dashboard
  2. Click the Authorize button above
  3. Enter only the token (e.g., TBR_abc123def456)
  4. Click Authorize then Close

Important

  • โœ… Correct: TBR_8b3364b5a772328a
  • โŒ Wrong: Bearer TBR_8b3364b5a772328a

The 'Bearer ' prefix is added automatically by Swagger UI.

Body

Request to enhance audio quality

file_idstring ยท min: 1Required

File ID to enhance

denoisebooleanOptional

Apply noise reduction

Default: true
noise_reductionnumber ยท max: 1Optional

Noise reduction strength (0.0-1.0)

Default: 0.5
eq_presetall ofOptional

EQ preset to apply

Default: balanced
string ยท enumOptional

Available EQ presets for audio enhancement

Possible values:
compressionbooleanOptional

Apply dynamic range compression

Default: false
normalizebooleanOptional

Normalize audio levels

Default: true
Responses
200

Job created successfully

application/json
post
/v1/audio/enhance

Diarize audio with speaker identification

post

Perform speaker diarization with optional transcription.

Features

  • Speaker Identification: Detect and label different speakers

  • Transcription: Optional speech-to-text with Whisper

  • Language Support: Multiple languages supported

Parameters

Parameter
Type
Description

file_id

string

File ID from upload

num_speakers

integer

Expected number of speakers (optional)

language

string

Language code (e.g., en, ko)

transcribe_model

string

Transcription model: whisper

Output Formats

  • timbr: Standard SRT format with speaker names

  • sentence_split: Sentence-segmented output

Workflow

  1. Upload audio file via POST /v1/upload

  2. Call this endpoint with file_id and options

  3. Poll GET /v1/jobs/{job_id} for status

  4. Download results from signed URLs when completed

Credits

  • Cost: 1 credit per second of audio

  • Billing: Credits reserved on job creation, confirmed on completion

Authorizations
AuthorizationstringRequired

How to Authenticate

  1. Get your token from Timbr Dashboard
  2. Click the Authorize button above
  3. Enter only the token (e.g., TBR_abc123def456)
  4. Click Authorize then Close

Important

  • โœ… Correct: TBR_8b3364b5a772328a
  • โŒ Wrong: Bearer TBR_8b3364b5a772328a

The 'Bearer ' prefix is added automatically by Swagger UI.

Body

Request for audio diarization (speaker identification + transcription)

The diarization pipeline:

  1. VAD: Voice Activity Detection to find speech segments
  2. STT: Speech-to-Text transcription
  3. Speaker Embedding: Extract speaker features
  4. Clustering: Group segments by speaker
  5. SRT Generation: Create subtitles with speaker labels
file_idstring ยท min: 1Required

File ID of the uploaded audio (from /v1/audio/upload)

num_speakersany ofOptional

Expected number of speakers (auto-detect if None)

integer ยท min: 1 ยท max: 20Optional
or
nullOptional
languagestringOptional

Language code (e.g., 'en', 'ko', 'ja') or 'auto' for detection

Default: auto
transcribe_modelall ofOptional

Transcription model to use

Default: whisper
string ยท enumOptional

Transcription model options

Possible values:
vad_typeall ofOptional

Voice Activity Detection type

Default: ten-vad
string ยท enumOptional

Voice Activity Detection type

Possible values:
vad_thresholdnumber ยท max: 1Optional

VAD sensitivity threshold

Default: 0.5
vad_min_speech_durationnumber ยท max: 5Optional

Minimum speech duration in seconds

Default: 0.1
vad_min_silence_durationnumber ยท max: 5Optional

Minimum silence duration in seconds

Default: 0.1
detect_genderbooleanOptional

Detect speaker gender (KBO baseball use case)

Default: false
input_srt_file_idany ofOptional

Optional pre-existing SRT file for alignment

stringOptional
or
nullOptional
Responses
200

Job created successfully

application/json
post
/v1/audio/diarize

Last updated