Rnn 17 (image captioning using coco dataset in pytorch)

2023-04-25 11 분 소요

이미지 캡셔닝(Image Captioning)

1. 데이터 다운로드

!mkdir ./mscoco

(1) train 데이터

!wget http://images.cocodataset.org/zips/train2014.zip
!unzip -q "train2014.zip" -d ./mscoco/
!rm train2014.zip

--2023-05-04 13:48:49--  http://images.cocodataset.org/zips/train2014.zip
Resolving images.cocodataset.org (images.cocodataset.org)... 52.217.139.209, 52.217.49.196, 3.5.29.150, ...
Connecting to images.cocodataset.org (images.cocodataset.org)|52.217.139.209|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13510573713 (13G) [application/zip]
Saving to: ‘train2014.zip’

train2014.zip       100%[===================>]  12.58G  33.7MB/s    in 6m 37s  

2023-05-04 13:55:26 (32.5 MB/s) - ‘train2014.zip’ saved [13510573713/13510573713]

(2) valid 데이터

# !wget http://images.cocodataset.org/zips/val2014.zip
# !unzip -q "val2014.zip" -d ./mscoco/
# !rm val2014.zip

(3) test 데이터

!wget http://images.cocodataset.org/zips/test2014.zip
!unzip -q "test2014.zip" -d ./mscoco/
!rm test2014.zip

--2023-05-04 13:57:20--  http://images.cocodataset.org/zips/test2014.zip
Resolving images.cocodataset.org (images.cocodataset.org)... 3.5.21.112, 52.217.97.60, 54.231.195.89, ...
Connecting to images.cocodataset.org (images.cocodataset.org)|3.5.21.112|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6660437059 (6.2G) [application/zip]
Saving to: ‘test2014.zip’

test2014.zip        100%[===================>]   6.20G  34.2MB/s    in 3m 15s  

2023-05-04 14:00:35 (32.6 MB/s) - ‘test2014.zip’ saved [6660437059/6660437059]

train, valid 용 정답 데이터

!wget http://images.cocodataset.org/annotations/annotations_trainval2014.zip
!unzip annotations_trainval2014.zip -d ./mscoco

--2023-05-04 14:01:30--  http://images.cocodataset.org/annotations/annotations_trainval2014.zip
Resolving images.cocodataset.org (images.cocodataset.org)... 52.216.140.132, 52.217.136.121, 52.217.132.121, ...
Connecting to images.cocodataset.org (images.cocodataset.org)|52.216.140.132|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 252872794 (241M) [application/zip]
Saving to: ‘annotations_trainval2014.zip’

annotations_trainva 100%[===================>] 241.16M  33.8MB/s    in 7.7s    

2023-05-04 14:01:38 (31.2 MB/s) - ‘annotations_trainval2014.zip’ saved [252872794/252872794]

Archive:  annotations_trainval2014.zip
  inflating: ./mscoco/annotations/instances_train2014.json  
  inflating: ./mscoco/annotations/instances_val2014.json  
  inflating: ./mscoco/annotations/person_keypoints_train2014.json  
  inflating: ./mscoco/annotations/person_keypoints_val2014.json  
  inflating: ./mscoco/annotations/captions_train2014.json  
  inflating: ./mscoco/annotations/captions_val2014.json  

test용 정답 데이터

!wget http://images.cocodataset.org/annotations/image_info_test2014.zip
!unzip -q "image_info_test2014.zip" -d ./mscoco/

--2023-05-04 14:01:45--  http://images.cocodataset.org/annotations/image_info_test2014.zip
Resolving images.cocodataset.org (images.cocodataset.org)... 52.217.171.121, 3.5.28.19, 54.231.163.73, ...
Connecting to images.cocodataset.org (images.cocodataset.org)|52.217.171.121|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 763464 (746K) [application/zip]
Saving to: ‘image_info_test2014.zip’

image_info_test2014 100%[===================>] 745.57K  1.70MB/s    in 0.4s    

2023-05-04 14:01:46 (1.70 MB/s) - ‘image_info_test2014.zip’ saved [763464/763464]

2. 데이터 불러오기

import numpy as np
import matplotlib.pyplot as plt
import random
import pickle
import os
import os.path
import time
import sys
from copy import deepcopy

!pip install nltk
import nltk
nltk.download('punkt')

#from tqdm import tqdm
from PIL import Image
from pycocotools.coco import COCO
import json
from collections import Counter

import torch
import torch.nn as nn
import torchvision.models as models
import torch.optim as optim
import torch.utils.data as data
from torchvision import transforms

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (3.8.1)
Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk) (8.1.3)
Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.10/dist-packages (from nltk) (2022.10.31)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from nltk) (4.65.0)
Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk) (1.2.0)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.

import random

# 시드값 고정
seed = 50
os.environ['PYTHONHASHSEED'] = str(seed)
random.seed(seed)                # 파이썬 난수 생성기 시드 고정
np.random.seed(seed)             # 넘파이 난수 생성기 시드 고정
torch.manual_seed(seed)          # 파이토치 난수 생성기 시드 고정 (CPU 사용 시)
torch.cuda.manual_seed(seed)     # 파이토치 난수 생성기 시드 고정 (GPU 사용 시)
torch.cuda.manual_seed_all(seed) # 파이토치 난수 생성기 시드 고정 (멀티GPU 사용 시)
torch.backends.cudnn.deterministic = True # 확정적 연산 사용
torch.backends.cudnn.benchmark = False    # 벤치마크 기능 해제
torch.backends.cudnn.enabled = False      # cudnn 사용 해제

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

class Vocabulary(object):
    def __init__(self,
        vocab_threshold,
        vocab_file='./vocab.pkl',
        start_word="<start>",
        end_word="<end>",
        unk_word="<unk>",
        annotations_file='../cocoapi/annotations/captions_train2014.json',
        vocab_from_file=False):

        self.word2idx = {}
        self.idx2word = {}
        self.idx = 0

        # Initialize the vocabulary.
        self.vocab_threshold = vocab_threshold
        self.vocab_file = vocab_file
        self.start_word = start_word
        self.end_word = end_word
        self.unk_word = unk_word
        self.annotations_file = annotations_file
        self.vocab_from_file = vocab_from_file
        self.get_vocab()

    def get_vocab(self):        
        if os.path.exists(self.vocab_file) & self.vocab_from_file:
            with open(self.vocab_file, 'rb') as f:
                vocab = pickle.load(f)
                self.word2idx = vocab['word2idx']
                self.idx2word = vocab['idx2word']
            print('Vocabulary successfully loaded from vocab.pkl file!')
        else:
            self.build_vocab()
            with open(self.vocab_file, 'wb') as f:
                vocab = {}
                vocab['word2idx'] = self.word2idx
                vocab['idx2word'] = self.idx2word
                pickle.dump(vocab, f)
        
    def build_vocab(self):        
        self.add_word(self.start_word)
        self.add_word(self.end_word)
        self.add_word(self.unk_word)
        self.add_captions()

    def add_word(self, word):        
        if not word in self.word2idx:
            self.word2idx[word] = self.idx
            self.idx2word[self.idx] = word
            self.idx += 1

    def add_captions(self):        
        coco = COCO(self.annotations_file)
        counter = Counter()
        ids = coco.anns.keys()
        for i, id in enumerate(ids):
            caption = str(coco.anns[id]['caption'])
            tokens = nltk.tokenize.word_tokenize(caption.lower())
            counter.update(tokens)

            if i % 100000 == 0:
                print("[%d/%d] Tokenizing captions..." % (i, len(ids)))

        words = [word for word, cnt in counter.items() if cnt >= self.vocab_threshold]

        for word in words:
            self.add_word(word)

    def __call__(self, word):
        return self.word2idx.get(word, self.word2idx[self.unk_word])

    def __len__(self):
        return len(self.word2idx)

class CoCoDataset(data.Dataset):
    
    def __init__(self, transform, mode, batch_size, vocab_threshold, vocab_file, start_word, 
        end_word, unk_word, annotations_file, vocab_from_file, img_folder):
        self.transform = transform
        self.mode = mode
        self.batch_size = batch_size
        self.vocab = Vocabulary(vocab_threshold, vocab_file, start_word,
            end_word, unk_word, annotations_file, vocab_from_file)
        self.img_folder = img_folder

        if self.mode == 'train':
            self.coco = COCO(annotations_file)
            self.ids = list(self.coco.anns.keys())
            # 가변 길이 처리를 위해 모든 caption들의 길이를 구해놓자.            
            all_tokens = [nltk.tokenize.word_tokenize(str(self.coco.anns[self.ids[index]]['caption']).lower()) for index in (np.arange(len(self.ids)))]
            self.caption_lengths = [len(token) for token in all_tokens]
        else:
            # image_info_test2014.json 에는 annotations가 없음
            test_info = json.loads(open(annotations_file).read())
            self.paths = [item['file_name'] for item in test_info['images']]
        
    def __getitem__(self, index):
        # 훈련 모드
        if self.mode == 'train':
            ann_id = self.ids[index]
            caption = self.coco.anns[ann_id]['caption']
            img_id = self.coco.anns[ann_id]['image_id']
            path = self.coco.loadImgs(img_id)[0]['file_name']

            # (1) PIL 이미지를 변환기를 거쳐 전처리 하기(증강 포함)
            image = Image.open(os.path.join(self.img_folder, path)).convert('RGB')
            image = self.transform(image)

            # (2) Caption 벡터화 하기
            tokens = nltk.tokenize.word_tokenize(str(caption).lower())
            caption = []
            caption.append(self.vocab(self.vocab.start_word))
            caption.extend([self.vocab(token) for token in tokens])
            caption.append(self.vocab(self.vocab.end_word))
            caption = torch.Tensor(caption).long()

            return image, caption

        # 테스트 모드
        else:
            path = self.paths[index]

            PIL_image = Image.open(os.path.join(self.img_folder, path)).convert('RGB')
            orig_image = np.array(PIL_image)
            image = self.transform(PIL_image)

            return orig_image, image

    def get_train_indices(self):
        sel_length = np.random.choice(self.caption_lengths)
        all_indices = np.where([self.caption_lengths[i] == sel_length for i in np.arange(len(self.caption_lengths))])[0]
        indices = list(np.random.choice(all_indices, size=self.batch_size))
        return indices

    def __len__(self):
        if self.mode == 'train':
            return len(self.ids)
        else:
            return len(self.paths)

transform_train = transforms.Compose([ 
    transforms.Resize(256),                          
    transforms.RandomCrop(224),                      
    transforms.RandomHorizontalFlip(),               
    transforms.ToTensor(),                           
    transforms.Normalize((0.485, 0.456, 0.406),      
                         (0.229, 0.224, 0.225))])

vocab_threshold = 5
train_batch_size = 64
data_dir='./'
train_img_folder = data_dir + 'mscoco/train2014/'
train_annotations_file = data_dir + 'mscoco/annotations/captions_train2014.json'

train_dataset = CoCoDataset(transform=transform_train,
                      mode='train',
                      batch_size=train_batch_size,
                      vocab_threshold=vocab_threshold,
                      vocab_file='./vocab.pkl',
                      start_word="<start>",
                      end_word="<end>",
                      unk_word="<unk>",
                      annotations_file=train_annotations_file,
                      vocab_from_file=False,
                      img_folder=train_img_folder)

loading annotations into memory...
Done (t=1.59s)
creating index...
index created!
[0/414113] Tokenizing captions...
[100000/414113] Tokenizing captions...
[200000/414113] Tokenizing captions...
[300000/414113] Tokenizing captions...
[400000/414113] Tokenizing captions...
loading annotations into memory...
Done (t=1.15s)
creating index...
index created!

transform_test = transforms.Compose([ 
    transforms.Resize(256),                          
    transforms.RandomCrop(224),                      
    transforms.RandomHorizontalFlip(),               
    transforms.ToTensor(),                           
    transforms.Normalize((0.485, 0.456, 0.406),      
                         (0.229, 0.224, 0.225))])

test_batch_size = 1
test_img_folder = data_dir + 'mscoco/test2014/'
test_annotations_file = data_dir + 'mscoco/annotations/image_info_test2014.json'

test_dataset = CoCoDataset(transform=transform_test,
                      mode='test',
                      batch_size=test_batch_size,
                      vocab_threshold=vocab_threshold,
                      vocab_file='./vocab.pkl',
                      start_word="<start>",
                      end_word="<end>",
                      unk_word="<unk>",
                      annotations_file=test_annotations_file,
                      vocab_from_file=True,
                      img_folder=test_img_folder)

Vocabulary successfully loaded from vocab.pkl file!

3. 데이터 적재하기

indices = train_dataset.get_train_indices()
initial_sampler = data.sampler.SubsetRandomSampler(indices=indices)
batch_sampler=data.sampler.BatchSampler(sampler=initial_sampler, batch_size=train_dataset.batch_size, drop_last=False)

train_dataloader= data.DataLoader(dataset=train_dataset, batch_sampler=batch_sampler, num_workers=2)

test_dataloader = data.DataLoader(dataset=test_dataset, batch_size=test_dataset.batch_size, 
                                  shuffle=True,num_workers=2)

batch = next(iter(train_dataloader))
batch[0].size(), batch[1].size() # image, caption

(torch.Size([64, 3, 224, 224]), torch.Size([64, 10]))

batch = next(iter(test_dataloader))
batch[0].size(), batch[1].size() # orig_image, image

(torch.Size([1, 427, 640, 3]), torch.Size([1, 3, 224, 224]))

# pretrained resnet 의 feature 이용하기
# https://stackoverflow.com/questions/52548174/how-to-remove-the-last-fc-layer-from-a-resnet-model-in-pytorch
# https://discuss.pytorch.org/t/how-can-l-use-the-pre-trained-resnet-to-extract-feautres-from-my-own-dataset/9008

4. 모델 생성하기

Image2Caption Architecture

(1) CNN Encoder

batch = next(iter(train_dataloader))
batch[0].size(), batch[1].size() # image, caption

(torch.Size([64, 3, 224, 224]), torch.Size([64, 10]))

resnet = models.resnet50(weights=True)
modules = list(resnet.children())[:-1]
resnet = nn.Sequential(*modules)
features = resnet(batch[0])
features

/usr/local/lib/python3.10/dist-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet50_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet50_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
Downloading: "https://download.pytorch.org/models/resnet50-0676ba61.pth" to /root/.cache/torch/hub/checkpoints/resnet50-0676ba61.pth
100%|██████████| 97.8M/97.8M [00:00<00:00, 332MB/s]

tensor([[[[0.3006]],

         [[1.3202]],

         [[0.5919]],

         ...,

         [[0.3854]],

         [[0.3244]],

         [[0.4089]]],

        [[[0.1240]],

         [[0.3033]],

         [[0.3624]],

         ...,

         [[0.7815]],

         [[0.7007]],

         [[0.1543]]],

        [[[0.1694]],

         [[2.3620]],

         [[1.7347]],

         ...,

         [[0.3217]],

         [[0.0813]],

         [[0.1497]]],

        ...,

        [[[0.5665]],

         [[1.1547]],

         [[1.0198]],

         ...,

         [[0.1258]],

         [[0.0584]],

         [[0.2763]]],

        [[[1.0296]],

         [[0.1340]],

         [[0.3449]],

         ...,

         [[0.0222]],

         [[0.1709]],

         [[0.2121]]],

        [[[0.7039]],

         [[0.1935]],

         [[0.1769]],

         ...,

         [[0.1285]],

         [[0.1630]],

         [[0.5854]]]], grad_fn=<MeanBackward1>)

features.shape

torch.Size([64, 2048, 1, 1])

resnet = models.resnet50(weights=True)
resnet.fc.in_features

class Encoder(nn.Module):
    def __init__(self, wordvec_size):
        super().__init__()
        resnet = models.resnet50(weights=True)
        for param in resnet.parameters():
            param.requires_grad_(False)
        
        modules = list(resnet.children())[:-1]
        self.resnet = nn.Sequential(*modules)
        self.embed = nn.Linear(resnet.fc.in_features, wordvec_size)

    def forward(self, images): # images shape : (batch_size=64, channels=3, h=224, w=224)
        features = self.resnet(images) # features shape : (batch_size=64, feature_maps= 2048, h=1, w=1)
        features = features.view(features.size(0), -1) # features shape : (batch_size=64, feature_size=2048)
        emb_features = self.embed(features) # emb_features shape : (batch_size=64, wordvec_size=256)
        return emb_features

(2) RNN Decoder

class Decoder(nn.Module):
    def __init__(self, vocab_size, wordvec_size, hidden_size, num_layers=1):
      
        super().__init__()
        self.embed = nn.Embedding(vocab_size, wordvec_size)

        self.lstm = nn.LSTM( input_size = wordvec_size, 
                             hidden_size = hidden_size, 
                             num_layers = num_layers, 
                             batch_first=True
                           )
        
        self.linear_fc = nn.Linear(hidden_size, vocab_size)

    
    def forward(self, captions, emb_features): # captions shape : (batch_size=64, cap_length=n-1)
                                               # emb_features shape : (batch_size=64, wordvec_size=256)
        embed = self.embed(captions)           # embed shape : (batch_size=64, caption_length=n-1, wordvec_size=256)
        embed_features = emb_features.unsqueeze(1) # emb_features shape : (batch_size=64, 1, wordvec_size=256)
        decoder_in = torch.cat((embed_features, embed), dim=1) # decoder_in shape : (batch_size=64, caption_length=n, wordvec_size=256)
        outputs, _ = self.lstm(decoder_in) # outputs shape : (batch_size=64, caption_length=n, hidden_size=512 )
        outputs = self.linear_fc(outputs) # outputs shape : (batch_size=64, caption_length=n, vocab_size=8852)
        return outputs

    
    def generate(self, emb_input, states=None, sample_size=20): # emb_input : (batch_size=1, wordvec_size=256)
        outputs = []
        outputs_length = 0
        emb_input = emb_input.unsqueeze(1) # emb_input : (batch_size=1, 1, wordvec_size=256)
        while(outputs_length != sample_size+1): # 최대 21 글자까지 샘플링
            output, states = self.lstm(emb_input, states) # states : (h, c)
                                                          # output shape : (batch_size=1, 1, hidden_size=512)
            output = output.squeeze(1)   # output shape : (batch_size=1, hidden_size=512)                                                       
            output = self.linear_fc(output) # output shape : (batch_size=1, vocab_size=8852)
            _, predicted_index = torch.max(output, 1)

            outputs.append(predicted_index.cpu().numpy()[0])

            if (predicted_index == 1): # <end> : 1 , 최대 21글자가 안되는 경우 멈춤
                break

            emb_input = self.embed(predicted_index) # emb_input : (batch_size=1, wordvec_size=256)
            emb_input = emb_input.unsqueeze(1) # emb_input : (batch_size=1, 1, wordvec_size=256)

            outputs_length += 1
        return outputs

class Image2Caption(nn.Module):
    def __init__(self, vocab_size, wordvec_size, hidden_size):
        
        super().__init__()
        self.encoder = Encoder(wordvec_size)
        self.decoder = Decoder(vocab_size, wordvec_size, hidden_size)

    def forward(self, inputs, targets): # inputs shape : (batch_size=64, img_size=3x224x224)
                                        # targets shape : (batch_size=64, caption_length=n)
        decoder_in = targets[:, :-1]    # decoder_in shape : (batch_size=64, caption_length=n-1)      
        emb_features = self.encoder(inputs) # emb_features shape : (batch_size=64, wordvec_size=256)
        out = self.decoder(decoder_in, emb_features) # out shpae : (batch_size=64, caption_length=n, vocab_size=8852)
        return out

    def generate(self, inputs, sample_size=20): # inputs shape : (batch_size=1, img_size=3x224x224)
        emb_features = self.encoder(inputs) # emb_features : (batch_size=1, wordvec_size=256)
        sampled = self.decoder.generate(emb_features, states=None, sample_size=sample_size)                
        return sampled

하이퍼 파라미터 설정

len(train_dataset.vocab)

vocab_size = len(train_dataset.vocab)
wordvec_size = 256
hidden_size = 512
learning_rate = 0.001
num_epochs = 1  
print_every = 200   

model = Image2Caption(vocab_size=vocab_size,    
                wordvec_size = wordvec_size, 
                hidden_size = hidden_size)
model = model.to(device)

batch = next(iter(train_dataloader))
out = model(batch[0].to(device), batch[1].to(device))
out.shape

torch.Size([64, 10, 8852])

5. 모델 설정

train_step = len(train_dataset) // train_batch_size
train_step

loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer=optimizer,
                                           mode='min', factor=0.4,
                                           patience=3, verbose=True)

6. 모델 훈련

def train_loop(model, loss_fn, epochs, optimizer):  

    for epoch in range(epochs):
        model.train()
        for step in range(1, train_step+1):        
            indices = train_dataset.get_train_indices()
            initial_sampler = data.sampler.SubsetRandomSampler(indices=indices)
            batch_sampler=data.sampler.BatchSampler(sampler=initial_sampler,
                                                    batch_size=train_batch_size,
                                                    drop_last=False)
            train_dataloader= data.DataLoader(dataset=train_dataset, num_workers=2,
                                         batch_sampler = batch_sampler)
            
            images, captions = next(iter(train_dataloader))

            images = images.to(device)
            captions = captions.to(device)
                        
            model.zero_grad()
            outputs = model(images, captions)                    
            loss = loss_fn(outputs.view(-1, vocab_size), captions.view(-1))            
            loss.backward()     
            optimizer.step()  
            if step % print_every == 0:               
                print("Epoch: {}/{}, Step: {}/{}, Train Loss : {:.4f}, Perplexity : {:.2f}".format(                    
                        epoch + 1, epochs,
                        step, train_step, 
                        loss.item(),
                        np.exp(loss.item())                        
                ))  
                
        # 학습 시간이 길어 epoch을 여러회 돌릴수 없어 1 epoch 후 모델 저장
        last_model_state = deepcopy(model.state_dict())          
        torch.save(last_model_state, 'last_checkpoint.pth')        

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive

%time train_loop(model, loss_fn, num_epochs, optimizer)

Epoch: 1/1, Step: 200/6470, Train Loss : 3.7343, Perplexity : 41.86
Epoch: 1/1, Step: 400/6470, Train Loss : 2.8933, Perplexity : 18.05
Epoch: 1/1, Step: 600/6470, Train Loss : 3.4233, Perplexity : 30.67
Epoch: 1/1, Step: 800/6470, Train Loss : 3.0874, Perplexity : 21.92
Epoch: 1/1, Step: 1000/6470, Train Loss : 2.6325, Perplexity : 13.91
Epoch: 1/1, Step: 1200/6470, Train Loss : 2.6859, Perplexity : 14.67
Epoch: 1/1, Step: 1400/6470, Train Loss : 2.6274, Perplexity : 13.84
Epoch: 1/1, Step: 1600/6470, Train Loss : 2.6551, Perplexity : 14.23
Epoch: 1/1, Step: 1800/6470, Train Loss : 2.7334, Perplexity : 15.39
Epoch: 1/1, Step: 2000/6470, Train Loss : 2.5037, Perplexity : 12.23
Epoch: 1/1, Step: 2200/6470, Train Loss : 2.5708, Perplexity : 13.08
Epoch: 1/1, Step: 2400/6470, Train Loss : 2.4916, Perplexity : 12.08
Epoch: 1/1, Step: 2600/6470, Train Loss : 2.2083, Perplexity : 9.10
Epoch: 1/1, Step: 2800/6470, Train Loss : 2.4470, Perplexity : 11.55
Epoch: 1/1, Step: 3000/6470, Train Loss : 2.4337, Perplexity : 11.40
Epoch: 1/1, Step: 3200/6470, Train Loss : 2.3103, Perplexity : 10.08
Epoch: 1/1, Step: 3400/6470, Train Loss : 2.3206, Perplexity : 10.18
Epoch: 1/1, Step: 3600/6470, Train Loss : 2.2548, Perplexity : 9.53
Epoch: 1/1, Step: 3800/6470, Train Loss : 2.3238, Perplexity : 10.21
Epoch: 1/1, Step: 4000/6470, Train Loss : 2.1114, Perplexity : 8.26
Epoch: 1/1, Step: 4200/6470, Train Loss : 2.2432, Perplexity : 9.42
Epoch: 1/1, Step: 4400/6470, Train Loss : 2.6744, Perplexity : 14.50
Epoch: 1/1, Step: 4600/6470, Train Loss : 2.1835, Perplexity : 8.88
Epoch: 1/1, Step: 4800/6470, Train Loss : 2.0635, Perplexity : 7.87
Epoch: 1/1, Step: 5000/6470, Train Loss : 2.2538, Perplexity : 9.52
Epoch: 1/1, Step: 5200/6470, Train Loss : 2.1247, Perplexity : 8.37
Epoch: 1/1, Step: 5400/6470, Train Loss : 2.3608, Perplexity : 10.60
Epoch: 1/1, Step: 5600/6470, Train Loss : 2.3226, Perplexity : 10.20
Epoch: 1/1, Step: 5800/6470, Train Loss : 2.2668, Perplexity : 9.65
Epoch: 1/1, Step: 6000/6470, Train Loss : 2.0447, Perplexity : 7.73
Epoch: 1/1, Step: 6200/6470, Train Loss : 2.1824, Perplexity : 8.87
Epoch: 1/1, Step: 6400/6470, Train Loss : 2.1658, Perplexity : 8.72
CPU times: user 37min 13s, sys: 57min 43s, total: 1h 34min 56s
Wall time: 3h 41min 36s

# 훈련 마친후 모델 저장하기
!mkdir /content/drive/MyDrive/img2cap_model

!cp last_checkpoint.pth /content/drive/MyDrive/img2cap_model

# state_dict = torch.load('last_checkpoint.pth')
# model.load_state_dict(state_dict)

7. 모델 평가

8. 모델 예측

orig_image, image = next(iter(test_dataloader))

# Visualize sample image, before pre-processing.
plt.imshow(np.squeeze(orig_image))
plt.title('example image')
plt.show()

png

image = image.to(device)
model.eval()
output=model.generate(image)
output

[0, 3, 169, 130, 170, 364, 161, 3, 1577, 18, 1]

def clean_sentence(output):

    list_words = []    
    for idx in output:
        list_words.append(test_dataloader.dataset.vocab.idx2word[idx])

    list_words = list_words[1:-1] 
    sentence = ' '.join(list_words) 
    sentence = sentence.capitalize()  
    
    return sentence

sentence = clean_sentence(output)
print('example sentence:', sentence)

example sentence: A man is standing next to a elephant .

def get_prediction():
    orig_image, image = next(iter(test_dataloader))
    plt.imshow(np.squeeze(orig_image))
    plt.title('Sample Image')
    plt.show()
    image = image.to(device)
    model.eval()
    output=model.generate(image)
    sentence = clean_sentence(output)
    print(sentence)

get_prediction()

png

A man riding a motorcycle down a road .

get_prediction()

png

A bathroom with a sink and a toilet in it .

get_prediction()

png

A computer desk with a keyboard and a monitor .

get_prediction()

png

A young boy holding a baseball bat on a field .

get_prediction()

png

A horse standing in a field next to a fence .

get_prediction()

png

A baseball player swinging a bat on a field .

get_prediction()

png

A man on a snowboard in the snow .

Reference

Udacity - Computer Vision NanoDegree

Twitter Facebook LinkedIn

Rnn 17 (image captioning using coco dataset in pytorch)

이미지 캡셔닝(Image Captioning)

1. 데이터 다운로드

2. 데이터 불러오기

3. 데이터 적재하기

4. 모델 생성하기

(1) CNN Encoder

(2) RNN Decoder

5. 모델 설정

6. 모델 훈련

7. 모델 평가

8. 모델 예측

Reference

공유하기

댓글남기기

참고

소개(about me)

Rnn 20 (attention and seq2seq learning using date dataset in pytorch)

Rnn 19 (attention and seq2seq learning using addition dataset in pytorch)

Rnn 18 (어텐션)