# Lecture IV - speech.ee.ntu.edu.tw

AI Alchemy: Encoder, Generator, and put them together Hung-yi Lee Machine Learning Looking for a Function Binary Classification ( ) Multi-class Classificatio n ( ) Yes/No

Class 1, Class 2, Class N Function f Function f Input Input Machine Learning Looking for a Function Structured input/output f

Speech Recognition f Summarization girl with red hair and red eyes f (title, summary) http://yumekui.pixnet.net/album/photo/13915541%E8%B6%85%E5%A4%A7%EF%BC%8E%E7%B7%B4%E6%88%90%E9%99%A3

Outline Auto-encoder Deep Learning Deep Generative Model Conditional Generation

Deep Learning in One Slide Many kinds of networks: Fully connected feedforward network (MLP) Convolutional neural network (CNN) Recurrent neural network (RNN) Vector Matrix They are functions.

Vector Seq How to find the function? Given the example inputs/outputs as training data: {(x1,y1),(x2,y2), , (x1000,y1000)} http://yumekui.pixnet.net/album/photo/13915541%E8%B6%85%E5%A4%A7%EF%BC%8E%E7%B7%B4%E6%88%90%E9%99%A3 Outline Auto-encoder Deep Learning

Deep Generative Model Conditional Generation 28 28 A digit can be represented as a 28 x 28 dim vector Most 28 x 28 dim vectors are not digits 3 3 3 3 3 Unsupervised Leaning 20

10 0 10 20 Auto-encoder Low dimension NN Encoder 28 X 28 = 784 code

code Compact representation of the input object Learn together NN Decoder Can reconstruct the original object Deep Auto-encoder

Unsupervised Leaning NN encoder + NN decoder = a deep network As close as possible Code Output Layer Decoder Encoder Layer

Layer Layer bottle Layer Layer Layer Input Layer

^ Reference: Hinton, Geoffrey E., and Ruslan R. Salakhutdinov. "Reducing the dimensionality of data with neural networks." Science 313.5786 (2006): 504-507 Deep Auto-encoder - Example NN Encoder PCA

32dim Pixel -> tSNE Word Embedding Machine learn the meaning of words from reading a lot of documents without supervision tree NN Encoder flower run jump

dog rabbit cat tree To learn more https://www.youtube.com/watch? v=X7PH3NuYW0Q Word Embedding Machine learn the meaning of words from reading a lot of documents without supervision A word can be understood by its context

are something very similar 520 520 You shall know a word by the company it keeps Word Embedding Characteristics ( h ) ( h ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) Solving analogies Rome : Italy = Berlin : ?

Compute Find the word w with the closest V(w) 13 Word Embedding - Demo Machine learn the meaning of words from reading a lot of documents without supervision Word Embedding - Demo Model used in demo is provided by Part of the project done by TA: Training data is from PTT (collected by ) 15

Audio Word to Vector Machine does not have any prior knowledge Machine listens to lots of audio book Like an infant [Chung, Interspeech 16) Audio Word to Vector Dimension reduction for a sequence with variable length audio segments (word-level) Fixed-length vector dog never dog never

dogs never ever ever Sequence-to-sequence Auto-encoder vector audio segment RNN Encoder The vector we want Can represent the whole

audio segment How to train RNN Encoder? x1 x2 x3 x4 acoustic features audio segment Sequence-to-sequence Input acoustic features Auto-encoder The RNN encoder and decoder are jointly trained.

x1 x2 x3 x4 y1 y2 y3 y4

RNN Encoder RNN Decoder x1 x2 x3 x4 acoustic features audio segment Sequence-to-sequence Auto-encoder Visualizing embedding vectors of the words

fear fame name near Sequence-to-sequence Auto-encoder Visualizing embedding vectors of the words say says hand day

days hands word words Audio Word to Vector Application spoken query US President user US President

US President Spoken Content Compute similarity between spoken queries and audio files on acoustic level, and find the query term Audio Word to Vector Application Audio archive divided into variablelength audio segments Off-line Audio Segment to Vector

Spoken Query Audio Segment to Vector On-line Similarity Search Result Audio Word to Vector Application Query-by-Example Spoken Term Detection

SA: sequence auto-encoder MAP DSA: de-noising sequence auto-encoder Input: clean speech + noise output: clean speech training epochs for sequence auto-encoder Next Step Can we include semantics? walk dog

walked cat cats run flower tree http://yumekui.pixnet.net/album/photo/13915541%E8%B6%85%E5%A4%A7%EF%BC%8E%E7%B7%B4%E6%88%90%E9%99%A3 Outline Auto-encoder Deep Learning

Deep Generative Model Conditional Generation Creation http://www.rb139.com/ index.php?s=/Lot/44547 Drawing? Writing Poems? http://yumekui.pixnet.net/album/photo/13915541%E8%B6%85%E5%A4%A7%EF%BC%8E%E7%B7%B4%E6%88%90%E9%99%A3

Outline Auto-encoder Deep Learning Deep Generative Model Component-wised, VAE, GAN Conditional Generation Component-by-component Images are composed of pixels To create an image, generating a pixel each time E.g. 3 x 3 images

RNN RNN RNN Can be trained just with a large collection of images without any annotation Component-by-component - Small images of 792 Pokmon's Can machine learn to create new Pokmons? Don't catch them! Create them! Source of image: http://bulbapedia.bulbagarden.net/wiki/Li

st_of_Pok%C3%A9mon_by_base_stats_(Generation_VI) Original image is 40 x 40 Making them into 20 x 20 Using 1-layer RNN with 512 LSTM cells Real Pokmon Never seen by machine! Cover 50% Cover 75% It is difficult to evaluate generation.

Component-by-component - Drawing from scratch Need some randomness Component-by-component Audio: Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, Koray Kavukcuoglu, WaveNet: A Generative Model for Raw Audio, arXiv preprint, 2016 Video: Nal Kalchbrenner, Aaron van den Oord, Karen Simonyan, Ivo Danihelka, Oriol Vinyals, Alex Graves, Koray Kavukcuoglu, Video Pixel Networks , arXiv preprint, 2016

http://yumekui.pixnet.net/album/photo/13915541%E8%B6%85%E5%A4%A7%EF%BC%8E%E7%B7%B4%E6%88%90%E9%99%A3 Outline VAE = Variational Auto-Encoder Auto-encoder Deep Learning Deep Generative Model Component-wised, VAE, GAN Conditional Generation

Remember Auto-encoder? As close as possible code NN Encoder code Randomly generate a vector as code NN Decoder

NN Decoder Image ? Remember Auto-encoder? 2D NN Decoder code NN Decoder 1.5

0 [ ] -1.5 [ 1.5 0 1.5 ] NN Decoder

Remember Auto-encoder? -1.5 1.5 Auto-encoder input NN Encoder output code

VAE input NN Decoder NN Encoder Minimize reconstruction error m1 m2

m3 1 exp 2 3 1 X From a normal 2 distribution 3 Auto-Encoding Variational Bayes, https://arxiv.org/abs/1312.6114 +

1 2 3 NN Decoder output = ( ) + 3 Minimize 2 ( ( ) ( 1+ ) +( ) )

=1 Why VAE? Intuitive Reason ? decode code encode noise noise Problems of VAE It does not really try to simulate real images code

NN Decoder Output As close as possible One pixel difference from the target One pixel difference from the target Realistic Fake

http://yumekui.pixnet.net/album/photo/13915541%E8%B6%85%E5%A4%A7%EF%BC%8E%E7%B7%B4%E6%88%90%E9%99%A3 Outline GAN = Generative Adversarial Network Auto-encoder Deep Learning Deep Generative Model Component-wised, VAE, GAN

Conditional Generation Yann LeCuns comment https://www.quora.com/What-are-some-recent-and-potentially-upcoming-breakthroughsin-deep-learning Evolution http://peellden.pixnet.net/blog/post/404068992013-%E7%AC%AC%E5%9B%9B%E5%AD%A3%EF %BC%8C%E5%86%AC%E8%9D%B6%E5%AF %82%E5%AF%A5 Kallima inachus Brown

Butterflies are not brown Butterflies do not have veins veins .. The evolution of generatio n NN Generator v1 NN

Generator v2 NN Generator v3 Discriminator v1 Discriminator v2 Discriminator v3

Real images: GAN Source of images: https://zhuanlan.zhihu.com/p/24767059 DCGAN: https://github.com/carpedm20/DCGAN-tensorflow GAN 100 rounds GAN 1000 rounds GAN

2000 rounds GAN 5000 rounds GAN 10,000 rounds GAN 20,000 rounds GAN

50,000 rounds Basic Idea of GAN The data we want to generate has a distribution ( ) High Probability Image Space Low Probability

Basic Idea of GAN A generator G is a network. The network defines a probability distribution. ( ) Normal Distribution generator G ( ) = ( )

As close as possible It is difficult to compute We do not know what the distribution looks like. https://blog.openai.com/generative-models/ Basic Idea of GAN Normal Distribution ( ) NN Generator

v1 0 0 0 0 1 1 1

1 ( ) image Discriminator v1 1/0 It can be proofed that the loss the discriminator related to JS divergence. Normal

Distribution Basic Idea of GAN Next step: Updating the parameters of generato r To minimize the JS divergence The output be classified as real (as close to 1 as possible) Generator + Discriminator = a network Using gradient descent to update the parameters in the generator, but fix the discriminator

NN Generator v2 v1 Discriminator v1 Original GAN is hard to train . W-GAN 1.0 0.13 http://www.guokr.com/post/773890/

Why GAN is hard to train? Better Why GAN is hard to train? ( ) ( ) 0 ( ) =2 1

? ( ) ( ) 50 100 2 Not really better

( ) ( ) = 2 ( ) ( ) =0 2 Using Wasserstein distance instead of JS divergence WGAN 0

( ) ( ) 0 Better ( ) 1 5 0 ( )

50 ( ) = 0 ( ) =50 2 ( ) 100 ( ) ( ) =0 2

WGAN NN Generator v1 Discriminator v1 NN Generator v2 Discriminator

v2 NN Generator v3 Discriminator v3 Real poems: WGAN

Random generated

So many GANs Just name a few Modifying the Optimization of GAN fGAN WGAN Least-square GAN Loss Sensitive GAN Energy-based GAN Boundary-seeking GAN Unroll GAN Different Structure from the Original GAN Conditional GAN

Semi-supervised GAN InfoGAN BiGAN Cycle GAN Disco GAN VAE-GAN http://yumekui.pixnet.net/album/photo/13915541%E8%B6%85%E5%A4%A7%EF%BC%8E%E7%B7%B4%E6%88%90%E9%99%A3 Outline Auto-encoder Deep Learning Deep Generative Model

Conditional Generation Conditional Generation We dont want to simply generate some random stuff. Generate sentences based on conditions: Caption Generation A dog is running. Given condition: Chat-bot Given condition: Hello

Hello. Nice to see you. Conditional Generation E.g. sentence red hair code NN Encoder code

NN Generator image ? Conditional Need some Generation supervision E.g. sentence red hair

green hair code NN Encoder NN Generator code Conditional Generation

E.g. Red hair, long hair Black hair, blue eyes Blue hair, green eyes Text to Text - Summarizatio n Abstractive Summary: summary Machine learns to do title generation from 2,000,000 tr aining examples title [Yu & Lee, SLT 16

Text to Text - Summarizatio n Summary (short word sequence) x1 x2 x3

y1 y2 y3 y4 xN Input document (long word sequence)

Text to Text - Summarizatio n 22 Document: 15 3 Human: 15 Machine: Document: Human: Machine: Text to Text - Summarizatio

n Demo: Video to Text A girl is running. Video A group of people is knocked by a tree. A group of people is walking in the Video to Text

. (period) a girl Sequence-tosequence learning Video to code Sentence Generator Video to Text Can machine describe what it see from video?

Demo: MTK Image to Text Represent the input condition as a vector, and consider the vector as the input of RNN generator CNN . (period) A code woman

A Image Caption Generation Input image Image to Text Can machine describe what it see from image? Demo: MTK

http://news.ltn.com.tw/photo/politics/ breakingnews/975542_1 To Learn More Machine Learning Slides: http://speech.ee.ntu.edu.tw/~tlkagk/courses_M L16.html Video: https://www.youtube.com/watch?v=fegAeph9U aA&list=PLJV_el3uVTsPy9oCRY30oBPNLCo89yu49 Machine Learning and Having it Deep and Structured Slides: http://speech.ee.ntu.edu.tw/~tlkagk/courses_M LDS17.html Video: https://www.youtube.com/watch?v=IzHoNwlCG nE&list=PLJV_el3uVTsPMxPbjeX7PicgWbY7F8wW9

Thank you for your atte ntion!

## Recently Viewed Presentations

• Arial Default Design Photo Analysis Directions Typical Homestead Beach Picture Typical Squatter's Camp Sharpeville Uprising Rodden Island Prison Man with Passbook Checking Passbook Soweto Uprising Funeral and Protest Separate Bathroom Facilities Journal Entry- Put on back of photo analysis sheet
• Transverse waves have the following parts: a crest, a trough, a wavelength, and amplitude. Amplitude The amplitude is a measurement of the top (or bottom) half of the wave.
• Sugar. Source. Importance. Clinical. Significance. D-Glucose. Fruit juices. Hydrolysis of starch, cane sugar, maltose, and lactose. The sugar of the body. carried by blood the principal one used by the tissues
• EXHIBIT 3-4 Example of U.S. City Classification by Dominant Economic Base. SLIDE *Location quotients do change over time causing cities to move from one EBC to another—however, this is infrequent. Source: G. Mueller. 1993. Refining Economic Diversification Strategies for Real...
• Use the diagonal rule. Scandium is #21. That means Sc has _____ electrons. Electron configuration for … Scandium Electron configuration for … Scandium Practice Electron configurations for: Neon Aluminum Vanadium Electron Distribution Goal: Determine electron structures in atoms Orbital level...
• UN Workshop on Water Accounting Santo Domingo 16 -18 July 2007 ... Slide 22 THE END YOU NEVER MISS THE WATER T'ILL THE WELL RUNS DRY Source:UWI***** MACRO ECONOMIC INDICATORS Population 1.3 million 2006 Projections Nominal GDP 18.2 Billion US...
• Back Boundary Vertex A vertex within a semi-cycle of triangles is a boundary vertex. Back Interior Edge Vertex If a vertex is used by exactly two feature edges, the vertex is an interior edge vertex. Back Corner Vertex The vertex...
• Developmentally Appropriate Practices (DAP) Developmentally Appropriate Practices Guidelines for Practice Creating a caring community of learners Teaching to enhance development and learning Constructing appropriate curriculum Assessing children's learning and development Establishing reciprocal relationships with families Teaching to Enhance Development and...