Oct 28, 2017 3 min read illustration2vec

为 Animeloop 生成标签

截止目前 animeloop.org 已经有 410 series, 4153 episodes, 174905 loops 这么大量的数据了。但是这些数据都还只是原始数据，没有任何的标签记录。

这次借助 illustration2vec 这个项目来给每个 loop 生成一些标签，让这些大量的数据直接相互联系、进行分类。

illustration2vec 目前由于很长一段时间没维护了，直接运行起来有点问题，有些许代码行段落需要 fix 一些，所以这里就直接 fork 了一份来修复暂时用着，项目代码仓库：https://github.com/moeoverflow/illustration2vec。

运行程序得到的原始数据（Raw Data）：

[{'character': [(u'hatsune miku', 0.9999994039535522)],
  'copyright': [(u'vocaloid', 0.9999998807907104)],
  'general': [(u'thighhighs', 0.9956372380256653),
   (u'1girl', 0.9873462319374084),
   (u'twintails', 0.9812833666801453),
   (u'solo', 0.9632901549339294),
   (u'aqua hair', 0.9167950749397278),
   (u'long hair', 0.8817108273506165),
   (u'very long hair', 0.8326570987701416),
   (u'detached sleeves', 0.7448858618736267),
   (u'skirt', 0.6780789494514465),
   (u'necktie', 0.5608364939689636),
   (u'aqua eyes', 0.5527772307395935)],
  'rating': [(u'safe', 0.9785731434822083),
   (u'questionable', 0.020535090938210487),
   (u'explicit', 0.0006299660308286548)]}]

可以看到大致上能识别出的信息有

character
copyright
general
rating -> safe(safe, question, explicit)

在存入数据库的时候可以通过多添加一个数据列来展平数据（Flap Map），数据库 Schema：

loopid	type	value	confidence	source	lang
ObjectId	String	String	String	String	String

{
    _id: ObjectId,
    loopid: ObjectId,
    type: String, // ['character', 'copyright', 'general', 'safe']
    value: String,
    confidence: Number, //Double
    source: String, // default: 'illustration2vec'
    lang: String, // default: 'en'
}

简单脚本 ^[1]：

import os
import sys
import logging
from tqdm import tqdm
from illustration2vec import i2v
from PIL import Image
from pymongo import MongoClient
from bson.objectid import ObjectId

from config import config

# Logger configure
logging.getLogger('illus2vec')
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logger = logging.getLogger()

IMAGES_PATH = config['images_path']

# Database initial
logger.info('Connecting to database...')
client = MongoClient("localhost", 27017)
db = client.animeloop_tags

# chainer models initial
logger.info('Loading chainer models...')
illust2vec = i2v.make_i2v_with_chainer(
    config['caffemodel'], config['tag_list'])


# save tags estimated from image file into dababase
def to_tags(filename, loopid):
    image = Image.open(filename)
    result = illust2vec.estimate_plausible_tags([image], threshold=0.5)[0]

    tag_shcema = {
        'loopid': ObjectId(loopid),
        'source': 'illustration2vec',
        'lang': 'en'
    }

    # Extract tags from database to memory
    # for performance optimization
    saved_tags = list(db.tags.find({'loopid': ObjectId(loopid)}))

    def exist_in_tagslist(loopid, type, value):
        for t in saved_tags:
            if str(t['loopid']) == loopid and t['type'] == type and t['value'] == value:
                return True
        return False

    for key in result.keys():
        for item in result[key]:
            tag = tag_shcema.copy()
            if key is 'rating':
                tag['type'] = 'safe'
            else:
                tag['type'] = key
            tag['value'] = item[0]
            tag['confidence'] = item[1]

            # Avoid saving duplicate data
            if not exist_in_tagslist(loopid, tag['type'], tag['value']):
                db.tags.insert_one(tag)

    db.tagscheck.insert_one({'loopid': ObjectId(loopid)})


# performance optimization
saved_tagscheck = map(lambda tc: str(tc['loopid']), list(db.tagscheck.find({})))

logger.info('Loading files list...')
files = os.listdir(IMAGES_PATH)

logger.info('Estimating tags')
progress_bar = tqdm(files, ascii=True, dynamic_ncols=True, bar_format='{l_bar}{bar}| {n_fmt}/{total_fmt}')
for file in progress_bar:
    if not os.path.isdir(file):
        filename = IMAGES_PATH + '/' + file
        loopid = os.path.splitext(file)[0]
        ext = os.path.splitext(file)[1]

        if not (ext == '.jpg' or ext == '.png'):
            continue

        if loopid not in saved_tagscheck:
            to_tags(filename, loopid)

~~考虑到数据的量特别的大，在写程序生成的时候，要使用到任务队列，以多个任务并发运行的方式提高综合速度。~~

一开始考虑到 File I/O 的问题，实际跑程序的过程发现，大量的时间是消耗在了 i2v 识别 tags 的时候，所以就没有再考虑加入多线程并行。

最后在 Animeloop 上展示出结果：

Animeloop illustration2vec tags

https://github.com/moeoverflow/animeloop-illus2vec/blob/master/main.py ↩︎

You might also like...

Animeloop: animation loop recognition

Animeloop 动漫循环识别方法

Animeloop 项目计划

Animeloop 网页服务器开发细节