为 Animeloop 生成标签

illustration2vec Oct 28, 2017

截止目前 animeloop.org 已经有 410 series, 4153 episodes, 174905 loops 这么大量的数据了。但是这些数据都还只是原始数据,没有任何的标签记录。

这次借助 illustration2vec 这个项目来给每个 loop 生成一些标签,让这些大量的数据直接相互联系、进行分类。

illustration2vec 目前由于很长一段时间没维护了,直接运行起来有点问题,有些许代码行段落需要 fix 一些,所以这里就直接 fork 了一份来修复暂时用着,项目代码仓库:https://github.com/moeoverflow/illustration2vec

运行程序得到的原始数据(Raw Data):

[{'character': [(u'hatsune miku', 0.9999994039535522)],
  'copyright': [(u'vocaloid', 0.9999998807907104)],
  'general': [(u'thighhighs', 0.9956372380256653),
   (u'1girl', 0.9873462319374084),
   (u'twintails', 0.9812833666801453),
   (u'solo', 0.9632901549339294),
   (u'aqua hair', 0.9167950749397278),
   (u'long hair', 0.8817108273506165),
   (u'very long hair', 0.8326570987701416),
   (u'detached sleeves', 0.7448858618736267),
   (u'skirt', 0.6780789494514465),
   (u'necktie', 0.5608364939689636),
   (u'aqua eyes', 0.5527772307395935)],
  'rating': [(u'safe', 0.9785731434822083),
   (u'questionable', 0.020535090938210487),
   (u'explicit', 0.0006299660308286548)]}]

可以看到大致上能识别出的信息有

  • character
  • copyright
  • general
  • rating -> safe(safe, question, explicit)

在存入数据库的时候可以通过多添加一个数据列来展平数据(Flap Map),数据库 Schema:

loopid type value confidence source lang
ObjectId String String String String String
{
    _id: ObjectId,
    loopid: ObjectId,
    type: String, // ['character', 'copyright', 'general', 'safe']
    value: String,
    confidence: Number, //Double
    source: String, // default: 'illustration2vec'
    lang: String, // default: 'en'
}

简单脚本 [1]

import os
import sys
import logging
from tqdm import tqdm
from illustration2vec import i2v
from PIL import Image
from pymongo import MongoClient
from bson.objectid import ObjectId

from config import config

# Logger configure
logging.getLogger('illus2vec')
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logger = logging.getLogger()

IMAGES_PATH = config['images_path']

# Database initial
logger.info('Connecting to database...')
client = MongoClient("localhost", 27017)
db = client.animeloop_tags

# chainer models initial
logger.info('Loading chainer models...')
illust2vec = i2v.make_i2v_with_chainer(
    config['caffemodel'], config['tag_list'])


# save tags estimated from image file into dababase
def to_tags(filename, loopid):
    image = Image.open(filename)
    result = illust2vec.estimate_plausible_tags([image], threshold=0.5)[0]

    tag_shcema = {
        'loopid': ObjectId(loopid),
        'source': 'illustration2vec',
        'lang': 'en'
    }

    # Extract tags from database to memory
    # for performance optimization
    saved_tags = list(db.tags.find({'loopid': ObjectId(loopid)}))

    def exist_in_tagslist(loopid, type, value):
        for t in saved_tags:
            if str(t['loopid']) == loopid and t['type'] == type and t['value'] == value:
                return True
        return False

    for key in result.keys():
        for item in result[key]:
            tag = tag_shcema.copy()
            if key is 'rating':
                tag['type'] = 'safe'
            else:
                tag['type'] = key
            tag['value'] = item[0]
            tag['confidence'] = item[1]

            # Avoid saving duplicate data
            if not exist_in_tagslist(loopid, tag['type'], tag['value']):
                db.tags.insert_one(tag)

    db.tagscheck.insert_one({'loopid': ObjectId(loopid)})


# performance optimization
saved_tagscheck = map(lambda tc: str(tc['loopid']), list(db.tagscheck.find({})))

logger.info('Loading files list...')
files = os.listdir(IMAGES_PATH)

logger.info('Estimating tags')
progress_bar = tqdm(files, ascii=True, dynamic_ncols=True, bar_format='{l_bar}{bar}| {n_fmt}/{total_fmt}')
for file in progress_bar:
    if not os.path.isdir(file):
        filename = IMAGES_PATH + '/' + file
        loopid = os.path.splitext(file)[0]
        ext = os.path.splitext(file)[1]

        if not (ext == '.jpg' or ext == '.png'):
            continue

        if loopid not in saved_tagscheck:
            to_tags(filename, loopid)

考虑到数据的量特别的大,在写程序生成的时候,要使用到任务队列,以多个任务并发运行的方式提高综合速度。

一开始考虑到 File I/O 的问题,实际跑程序的过程发现,大量的时间是消耗在了 i2v 识别 tags 的时候,所以就没有再考虑加入多线程并行。

最后在 Animeloop 上展示出结果:

Animeloop illustration2vec tags


  1. https://github.com/moeoverflow/animeloop-illus2vec/blob/master/main.py ↩︎

Tags

Great! You've successfully subscribed.
Great! Next, complete checkout for full access.
Welcome back! You've successfully signed in.
Success! Your account is fully activated, you now have access to all content.