Information technology - Coding of audio-visual objects - Part 3: Audio

Technologies de l'information — Codage des objets audiovisuels — Partie 3: Codage audio

General Information

Status
Withdrawn
Publication Date
15-Dec-1999
Withdrawal Date
15-Dec-1999
Current Stage
9599 - Withdrawal of International Standard
Start Date
20-Dec-2001
Completion Date
30-Oct-2025
Ref Project

Relations

Standard
ISO/IEC 14496-3:1999 - Information technology -- Coding of audio-visual objects
English language
644 pages
sale 15% off
Preview
sale 15% off
Preview
Standard
ISO/IEC 14496-3:1999 - Information technology -- Coding of audio-visual objects
English language
644 pages
sale 15% off
Preview
sale 15% off
Preview

Frequently Asked Questions

ISO/IEC 14496-3:1999 is a standard published by the International Organization for Standardization (ISO). Its full title is "Information technology - Coding of audio-visual objects - Part 3: Audio". This standard covers: Information technology - Coding of audio-visual objects - Part 3: Audio

Information technology - Coding of audio-visual objects - Part 3: Audio

ISO/IEC 14496-3:1999 is classified under the following ICS (International Classification for Standards) categories: 35.040 - Information coding; 35.040.40 - Coding of audio, video, multimedia and hypermedia information. The ICS classification helps identify the subject area and facilitates finding related standards.

ISO/IEC 14496-3:1999 has the following relationships with other standards: It is inter standard links to ISO/IEC 14496-3:1999/Amd 1:2000, ISO/IEC 14496-3:1999/Cor 1:2001; is excused to ISO/IEC 14496-3:1999/Cor 1:2001, ISO/IEC 14496-3:1999/Amd 1:2000. Understanding these relationships helps ensure you are using the most current and applicable version of the standard.

You can purchase ISO/IEC 14496-3:1999 directly from iTeh Standards. The document is available in PDF format and is delivered instantly after payment. Add the standard to your cart and complete the secure checkout process. iTeh Standards is an authorized distributor of ISO standards.

Standards Content (Sample)


����ISO/IEC
Contents for Subpart 6
6.1 Scope.2
6.2 Definitions .2
6.3 Symbols and abbreviations.3
6.4 MPEG-4 audio text-to-speech bitstream syntax.3
6.4.1 MPEG-4 audio TTSSpecificConfig .3
6.4.2 MPEG-4 audio text-to-speech payload.3
6.5 MPEG-4 audio text-to-speech bitstream semantics .5
6.5.1 MPEG-4 audio TTSSpecificConfig .5
6.5.2 MPEG-4 audio text-to-speech payload.6
6.6 MPEG-4 audio text-to-speech decoding process.7
6.6.1 Interface between DEMUX and syntactic decoder.8
6.6.2 Interface between syntactic decoder and speech synthesizer.8
6.6.3 Interface from speech synthesizer to compositor .8
6.6.4 Interface from compositor to speech synthesizer .8
6.6.5 Interface between speech synthesizer and phoneme/bookmark-to-FAP converter .9
Annex 6.A (informative) Applications of MPEG-4 audio text-to-speech decoder.10
Subpart 6 1
�ISO/IEC
Subpart 6 : TTSI
6.1 Scope
This subpart of ISO/IEC 14496-3 specifies the coded representation of MPEG-4 Audio Text-to-Speech (M-TTS)
and its decoder for high quality synthesized speech and for enabling various applications. The exact synthesis
method is not a standardization issue partly because there are already various speech synthesis techniques.
This subpart of ISO/IEC 14496-3 is intended for application to M-TTS functionalities such as those for facial
animation (FA) and moving picture (MP) interoperability with a coded bitstream. The M-TTS functionalities include a
capability of utilizing prosodic information extracted from natural speech. They also include the applications to the
speaking device for FA tools and a dubbing device for moving pictures by utilizing lip shape and input text
information.
The text-to-speech (TTS) synthesis technology is recently becoming a rather common interface tool and begins to
play an important role in various multimedia application areas. For instance, by using TTS synthesis functionality,
multimedia contents with narration can be easily composed without recording natural speech sound. Moreover,
TTS synthesis with facial animation (FA) / moving picture (MP) functionalities would possibly make the contents
much richer. In other words, TTS technology can be used as a speech output device for FA tools and can also be
used for MP dubbing with lip shape information. In MPEG-4, common interfaces only for the TTS synthesizer and
for FA/MP interoperability are defined. The M-TTS functionalities can be considered as a superset of the
conventional TTS framework. This TTS synthesizer can also utilize prosodic information of natural speech in
addition to input text and can generate much higher quality synthetic speech. The interface bitstream format is
strongly user-friendly: if some parameters of the prosodic information are not available, the missed parameters are
generated by utilizing preestablished rules. The functionalities of the M-TTS thus range from conventional TTS
synthesis function to natural speech coding and its application areas, i.e., from a simple TTS synthesis function to
those for FA and MP.
6.2 Definitions
6.2.1 International Phonetic Alphabet; IPA : The worldwide agreed symbol set to represent various phonemes
appearing in human speech.
6.2.2 lip shape pattern : A number that specifies a particular pattern of the preclassified lip shape.
6.2.3 lip synchronization : A functionality that synchronizes speech with corresponding lip shapes.
6.2.4 MPEG-4 Audio Text-to-Speech Decoder : A device that produces synthesized speech by utilizing the M-
TTS bitstream while supporting all the M-TTS functionalities such as speech synthesis for FA and MP dubbing.
6.2.5 moving picture dubbing : A functionality that assigns synthetic speech to the corresponding moving picture
while utilizing lip shape pattern information for synchronization.
6.2.6 M-TTS sentence : This defines the information such as prosody, gender, and age for only the corresponding
sentence to be synthesized.
6.2.7 M-TTS sequence : This defines the control information which affects all M-TTS sentences that follow this M-
TTS sequence.
6.2.8 phoneme/bookmark-to-FAP converter : A device that converts phoneme and bookmark information to
FAPs.
6.2.9 text-to-speech synthesizer : A device producing synthesized speech according to the input sentence
character strings.
6.2.10 trick mode : A set of functions that enables stop, play, forward, and backward operations for users.
2 Subpart 6
����ISO/IEC
6.3 Symbols and abbreviations
F0 fundamental frequency (pitch frequency)
DEMUX demultiplexer
FA facial animation
FAP facial animation parameter
ID identifier
IPA International Phonetic Alphabet
MP moving picture
M-TTS MPEG-4 Audio TTS
STOD story teller on demand
TTS text-to-speech
6.4 MPEG-4 audio text-to-speech bitstream syntax
6.4.1 MPEG-4 audio TTSSpecificConfig
TTSSpecificConfig() {
TTS_Sequence()
}
Table 6.4.1 – Syntax of TTS_Sequence
Syntax No. of bits Mnemonic
TTS_Sequence() {
TTS_Sequence_ID 5 uimsbf
Language_Code 18 uimsbf
Gender_Enable 1 bslbf
Age_Enable 1 bslbf
Speech_Rate_Enable 1 bslbf
Prosody_Enable 1 bslbf
Video_Enable 1 bslbf
Lip_Shape_Enable 1 bslbf
Trick_Mode_Enable 1 bslbf
}
6.4.2 MPEG-4 audio text-to-speech payload
AlPduPayload {
TTS_Sentence()
}
Subpart 6 3
�ISO/IEC
Table 6.4.2 – Syntax of TTS_Sequence
Syntax No. of bits Mnemonic
TTS_Sentence() {
TTS_Sentence_ID 10 uimsbf
Silence 1bslbf
if (Silence) {
Silence_Duration 12 uimsbf
}
else {
if (Gender_Enable) {
Gender 1 bslbf
}
if (Age_Enable) {
Age 3 uimsbf
}
if (!Video_Enable && Speech_Rate_Enable) {
Speech_Rate 4 uimsbf
}
Length_of_Text 12 uimsbf
for (j=0; j TTS_Text 8 bslbf
}
if (Prosody_Enable) {
Dur_Enable 1 bslbf
F0_Contour_Enable 1 bslbf
Energy_Contour_Enable 1 bslbf
Number_of_Phonemes 10 uimsbf
Phoneme_Symbols_Length 13 uimsbf
for (j=0 ; j Phoneme_Symbols 8 bslbf
}
for (j=0 ; j if(Dur_Enable) {
Dur_each_Phoneme 12 uimsbf
}
if (F0_Contour_Enable) {
Num_F0 5 uimsbf
for (k=0; k F0_Contour_each_Phoneme 8 uimsbf
F0_Contour_each_Phoneme_Time 12 uimsbf
}
}
if (Energy_Contour_Enable) {
4 Subpart 6
����ISO/IEC
Energy_Contour_each_Phoneme 8*3=24 uimsbf
}
}
}
if (Video_Enable) {
Sentence_Duration 16 uimsbf
Position_in_Sentence 16 uimsbf
Offset 10 uimsbf
}
if (Lip_Shape_Enable) {
Number_of_Lip_Shape 10 uimsbf
for (j=0 ; j Lip_Shape_in_Sentence 16 uimsbf
Lip_Shape 8 uimsbf
}
}
}
}
6.5 MPEG-4 audio text-to-speech bitstream semantics
6.5.1 MPEG-4 audio TTSSpecificConfig
TTS_Sequence_ID This is a five-bit ID to uniquely identify each TTS object appearing in one scene. Each
speaker in a scene will have distinct TTS_Sequence_ID.
Language_Code When this is "00" (00110000 00110000 in binary), the IPA is to be sent. In all other languages,
this is the ISO 639 Language Code. In addition to this 16 bits, two bits that represent dialects of each language is
added at the end (user defined).
Gender_Enable This is a one-bit flag which is set to ‘1’ when the gender information exists.
Age_Enable This is a one-bit flag which is set to ‘1’ when the age information exists.
Speech_Rate_Enable This is a one-bit flag which is set to ‘1’ when the speech rate information exists.
Prosody_Enable This is a one-bit flag which is set to ‘1’ when the prosody information exists.
Video_Enable This is a one-bit flag which is set to ‘1’ when the M-TTS decoder works with MP. In this case, M-
TTS should synchronize synthetic speech to MP and accommodate the functionality of ttsForward and
ttsBackward. When VideoEnable flag is set, M-TTS decoder uses system clock to select adequate TTS_Sentence
frame and fetches Sentence_Duration, Position_in_Sentence, Offset data. TTS synthesizer assigns appropriate
duration for each phoneme to meet Sentence_Duration. The starting point of speech in a sentence is decided by
Position_in_Sentence. If Position_in_Sentence equals 0 (the starting point is the initial of sentence), TTS uses
Offset as a delay time to synchronize synthetic speech to MP.
Lip_Shape_Enable This is a one-bit flag which is set to ‘1’ when the coded input bitstream has lip shape
information. With lip shape information, M-TTS request FA tool to change lip shape according to timing information
(Lip_Shape_in_Sentence) and predifined lip shape pattern.
Trick_Mode_Enable This is a one-bit flag which is set to ‘1’ when the coded input bitstream permits trick mode
functions such as stop, play, forward, and backward.
Subpart 6 5
�ISO/IEC
6.5.2 MPEG-4 audio text-to-speech payload
TTS_Sentence_ID This is a ten-bit ID to uniquely identify a sentence in the M-TTS text data sequence for
indexing purpose. The first five bits equal to the TTS_Sequence_ID of the speaker defined in subclause 6.1, and
the rest five bits are the sequential sentence number of each TTS object.
Silence This is a one-bit flag which is set to ‘1’ when the current position is silence.
Silence_Duration This defines the time duration of the current silence segment in milliseconds. It has a value
from 1 to 4095. The value ‘0’ is prohibited.
Gender This is a one-bit which is set to ‘1’ if the gender of the synthetic speech producer is male and ‘0’, if female.
Age This represents the age of the speaker for synthetic speech. The meaning of age is defined in Table 6.5.1.
Table 6.5.1 – Age mapping table
Age age of the speaker
000 below 6
001 6 - 12
010 13 - 18
011 19 - 25
100 26 - 34
101 35 - 45
110 45 - 60
111 over 60
Speech_Rate This defines the synthetic speech rate in 16 levels. The level 8 corresponds the normal speed of
the speaker defined in the current speech synthesizer, the level 0 corresponds to the slowest speed of the speech
synthesizer, and the level 15 corresponds to the fastest speed of the speech synthesizer.
Length_
...


� ISO/IEC ISO/IEC 14496-3:1999(E)
Contents for Subpart 5
5.1 Scope .8
5.1.1 Overview of section.8
5.1.1.1 Purpose.8
5.1.1.2 Introduction to major elements .8
5.2 Normative references .8
5.3 Definitions.8
5.4 Symbols and abbreviations .13
5.4.1 Mathematical operations.13
5.4.2 Description methods.13
5.4.2.1 Bitstream syntax .13
5.4.2.2 SAOL syntax.14
5.4.2.3 SASL Syntax.14
5.5 Bitstream syntax and semantics.14
5.5.1 Introduction to bitstream syntax.14
5.5.2 Bitstream syntax.14
5.6 Object types.19
5.7 Decoding process.20
5.7.1 Introduction.20
5.7.2 Decoder configuration header.20
5.7.3 Bitstream data and sound creation .20
5.7.3.1 Relationship with systems layer .20
5.7.3.2 Bitstream data elements.20
5.7.3.3 Scheduler semantics .21
5.7.4 Conformance.25
5.8 SAOL syntax and semantics.26
5.8.1 Relationship with bitstream syntax .26
5.8.2 Lexical elements .26
5.8.2.1 Concepts.26
5.8.2.2 Identifiers .27
5.8.2.3 Numbers.27
5.8.2.4 String constants.27
5.8.2.5 Comments.27
5.8.2.6 Whitespace .28
5.8.3 Variables and values .28
5.8.4 Orchestra.28
5.8.5 Global block .29
5.8.5.1 Syntactic form .29
5.8.5.2 Global parameter.29
Subpart 5 1
5.8.5.3 Global variable declaration .31
5.8.5.4 Route statement .32
5.8.5.5 Send statement.33
5.8.5.6 Sequence specification .34
5.8.6 Instrument definition . 36
5.8.6.1 Syntactic form .36
5.8.6.2 Instrument name .36
5.8.6.3 Parameter fields .36
5.8.6.4 Preset tag .36
5.8.6.5 Instrument variable declarations.37
5.8.6.6 Block of code statements.39
5.8.6.7 Expressions.46
5.8.6.8 Standard names .54
5.8.7 Opcode definition . 58
5.8.7.1 Syntactic Form .58
5.8.7.2 Rate tag .58
5.8.7.3 Opcode name.58
5.8.7.4 Formal parameter list.59
5.8.7.5 Opcode variable declarations.59
5.8.7.6 Opcode statement block .60
5.8.7.7 Opcode rate .60
5.8.8 Template declaration. 62
5.8.8.1 Syntactic form .62
5.8.8.2 Semantics .62
5.8.8.3 Template instrument definitions.62
5.8.9 Reserved words . 63
5.9 SAOL core opcode definitions and semantics . 64
5.9.1 Introduction. 64
5.9.2 Specialop type. 64
5.9.3 List of core opcodes. 65
5.9.4 Math functions . 66
5.9.4.1 Introduction .66
5.9.4.2 int .66
5.9.4.3 frac.66
5.9.4.4 dbamp.66
5.9.4.5 ampdb.66
5.9.4.6 abs .66
5.9.4.7 sgn .66
5.9.4.8 exp .66
5.9.4.9 log .67
5.9.4.10 sqrt .67
5.9.4.11 sin .67
5.9.4.12 cos .67
5.9.4.13 atan.67
5.9.4.14 pow.67
5.9.4.15 log10.68
5.9.4.16 asin .68
5.9.4.17 acos .68
5.9.4.18 ceil .68
5.9.4.19 floor .68
5.9.4.20 min.68
5.9.4.21 max .68
5.9.5 Pitch converters. 69
5.9.5.1 Introduction to pitch representations .69
5.9.5.2 gettune .69
5.9.5.3 settune.69
2 Subpart 5
� ISO/IEC ISO/IEC 14496-3:1999(E)
5.9.5.4 octpch.70
5.9.5.5 pchoct.70
5.9.5.6 cpspch.70
5.9.5.7 pchcps.70
5.9.5.8 cpsoct.71
5.9.5.9 octcps.71
5.9.5.10 midipch .71
5.9.5.11 pchmidi .71
5.9.5.12 midioct .71
5.9.5.13 octmidi .72
5.9.5.14 midicps .72
5.9.5.15 cpsmidi .72
5.9.6 Table operations .72
5.9.6.1 ftlen.72
5.9.6.2 ftloop .72
5.9.6.3 ftloopend.73
5.9.6.4 ftsr.73
5.9.6.5 ftbasecps.73
5.9.6.6 ftsetloop .73
5.9.6.7 ftsetend .73
5.9.6.8 ftsetbase.73
5.9.6.9 ftsetsr .74
5.9.6.10 tableread .74
5.9.6.11 tablewrite .74
5.9.6.12 oscil.74
5.9.6.13 loscil.75
5.9.6.14 doscil.75
5.9.6.15 koscil.76
5.9.7 Signal generators .76
5.9.7.1 kline .76
5.9.7.2 aline .77
5.9.7.3 kexpon.77
5.9.7.4 aexpon.78
5.9.7.5 kphasor .78
5.9.7.6 aphasor .79
5.9.7.7 pluck.79
5.9.7.8 buzz .80
5.9.7.9 grain.80
5.9.8 Noise generators .81
5.9.8.1 Note on noise generators and pseudo-random sequences .81
5.9.8.2 irand.82
5.9.8.3 krand.82
5.9.8.4 arand.82
5.9.8.5 ilinrand .82
5.9.8.6 klinrand .82
5.9.8.7 alinrand .83
5.9.8.8 iexprand .83
5.9.8.9 kexprand .83
5.9.8.10 aexprand .83
5.9.8.11 kpoissonrand .83
5.9.8.12 apoissonrand .84
5.9.8.13 igaussrand.84
5.9.8.14 kgaussrand.85
5.9.8.15 agaussrand.85
5.9.9 Filters .85
5.9.9.1 port .85
5.9.9.2 hipass.85
5.9.9.3 lopass.86
5.9.9.4 bandpass.86
5.9.9.5 bandstop .86
5.9.9.6 biquad.87
Subpart 5 3
5.9.9.7 allpass .87
5.9.9.8 comb.87
5.9.9.9 fir.88
5.9.9.10 iir.88
5.9.9.11 firt .88
5.9.9.12 iirt.89
5.9.10 Spectral analysis. 89
5.9.10.1 fft.89
5.9.10.2 ifft.90
5.9.11 Gain control. 91
5.9.11.1 rms.91
5.9.11.2 gain.92
5.9.11.3 balance.92
5.9.11.4 compressor.93
5.9.12 Sample conversion. 95
5.9.12.1 decimate.95
5.9.12.2 upsamp .95
5.9.12.3 downsamp .96
5.9.12.4 samphold .96
5.9.12.5 sblock.96
5.9.13 Delays . 97
5.9.13.1 delay.97
5.9.13.2 delay1 .97
5.9.13.3 fracdelay .97
5.9.14 Effects . 98
5.9.14.1 reverb .98
5.9.14.2 chorus .99
5.9.14.3 flange.99
5.9.14.4 fx_speedc.99
5.9.14.5 speedt.99
5.9.15 Tempo functions. 100
5.9.15.1 gettempo.100
5.9.15.2 settempo .100
5.10 SAOL core wavetable generators. 100
5.10.1 Introduction. 100
5.10.2 Sample . 100
5.10.3 Data . 101
5.10.4 Random. 101
5.10.5 Step . 102
5.10.6 Lineseg . 103
5.10.7 Expseg . 103
5.10.8 Cubicseg. 104
5.10.9 Spline . 104
5.10.10 Polynomial. 105
5.10.11 Window . 105
5.10.12 Harm. 106
5.10.13 Harm_phase . 106
5.10.14 Periodic. 106
5.10.15 Buzz. 107
5.10.16 Concat. 107
4 Subpart 5
� ISO/IEC ISO/IEC 14496-3:1999(E)
5.10.17 Empty.107
5.11 SASL syntax and semantics.108
5.11.1 Introduction.108
5.11.2 Syntactic form.108
5.11.3 Instr line.109
5.11.4 Control line.109
5.11.5 Tempo line.109
5.11.6 Table line .110
5.11.7 End line.110
5.12 SAOL/SASL tokenisation .110
5.12.1 Introduction.110
5.12.2 SAOL tokenisation.111
5.12.3 SASL tokenisation.111
5.13 Sample Bank syntax and semantics.112
5.13.1 Introduction.112
5.13.2 Elements of bitstream .112
5.13.3 Decoding process.112
5.13.3.1 Object type 2 .112
5.13.3.2 Object type 4 .113
5.14 MIDI semantics.113
5.14.1 Introduction.113
5.14.2 Object type 1 decoding process .114
5.14.3 Mapping MIDI events into orchestra control.114
5.14.3.1 Introduction.114
5.14.3.2 MIDI events.114
5.14.3.3 Standard MIDI Files.116
5.14.3.4 Default controller values.117
5.15 Input sounds and relationship with AudioBIFS.117
5.15.1 Introduction.117
5.15.2 Input sources and phaseGroup .118
5.15.3 The AudioFX node .118
5.15.3.1 Introduction.118
5.15.3.2 AudioFX orchestra parameters .118
5.15.3.3 AudioFX orchestra instantiation .119
5.15.3.4 AudioFX orchestra execution.119
5.15.3.5 Speed change functionality in the AudioFX node.119
5.15.4 Interactive 3-D spatial audio scenes.119
Annex 5.A (normative) Coding tables .120
Annex 5.B (informative) Encoding.123
5.B.1. Introduction .123
Subpart 5 5
5.B.2. Basic encoding. 123
5.B.2.1. Introduction .123
5.B.2.2. Tokenisation of SAOL data .123
5.B.2.3. Tokenisation of SASL data.123
5.B.2.4. Disassembly of sound samples.123
5.B.2.5 Assembly of decoder configuration information.124
5.B.2.6 Assembly of streaming bitstream .124
Annex 5.C (informative) lex/yacc grammars for SAOL. 125
5.C.1 Introduction . 125
5.C.2 Lexical grammar for SAOL in lex. 125
5.C.3 Syntactic grammar for SAOL in yacc. 127
Annex 5.D (informative) PICOLA Speed change algorithm. 131
5.D.1 Tool description . 131
5.D.2 Speed control process. 131
5.D.3 Time scale compression (High speed replay). 131
5.D.4 Time scale expansion (Low speed replay) . 132
Annex 5.E (informative) Random access to Structured audio bitstreams . 134
5.E.1 Introduction. 134
5.E.2 Difficulties in general-purpose random access . 134
5.E.3 Making Structured Audio bitstreams randomly-accessible. 135
5.E.3.1 Introduction.135
5.E.3.2 Constructs to avoid.135
5.E.3.3 Altering bitstreams to make them randomly accessible.135
Annex 5.F (informative) Directly-connected MIDI and microphone control of the orchestra. 139
5.F.1 Introduction. 139
5.F.2 MIDI controller recommended practices. 139
5.F.3 Live microphone recommended practices . 140
Annex 5.G (informative) Bibliography . 141
Alphabetical Index to Subpart 5 of ISO/IEC 14496-3 . 142
6 Subpart 5
� ISO/IEC ISO/IEC 14496-3:1999(E)
Figures
Figure 5.1 - Example of ordering instruments with ‘sequence’.35
Figure 5.2 - Example of ordering instruments with ‘sequence’.35
Figure 5.3 - Compressor characteristic function.94
Figure 5.4 - Block diagram for ‘fracdelay’ example .98
Figure 5.D.1 - Block Diagram of the Speed Controller .131
Figure 5.D.2 - Principle of Time Scale Compression.132
Figure 5.D.3 - Principle of Time Scale Expansion .
...

Questions, Comments and Discussion

Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.

Loading comments...