为 Animeloop 生成标签

截止目前 animeloop.org 已经有 410 series, 4153 episodes, 174905 loops 这么大量的数据了。但是这些数据都还只是原始数据,没有任何的标签记录。

这次借助 illustration2vec 这个项目来给每个 loop 生成一些标签,让这些大量的数据直接相互联系、进行分类。

illustration2vec 目前由于很长一段时间没维护了,直接运行起来有点问题,有些许代码行段落需要 fix 一些,所以这里就直接 fork 了一份来修复暂时用着,项目代码仓库:https://github.com/moeoverflow/illustration2vec

运行程序得到的原始数据(Raw Data):

[{'character': [(u'hatsune miku', 0.9999994039535522)],
  'copyright': [(u'vocaloid', 0.9999998807907104)],
  'general': [(u'thighhighs', 0.9956372380256653),
   (u'1girl', 0.9873462319374084),
   (u'twintails', 0.9812833666801453),
   (u'solo', 0.9632901549339294),
   (u'aqua hair', 0.9167950749397278),
   (u'long hair', 0.8817108273506165),
   (u'very long hair', 0.8326570987701416),
   (u'detached sleeves', 0.7448858618736267),
   (u'skirt', 0.6780789494514465),
   (u'necktie', 0.5608364939689636),
   (u'aqua eyes', 0.5527772307395935)],
  'rating': [(u'safe', 0.9785731434822083),
   (u'questionable', 0.020535090938210487),
   (u'explicit', 0.0006299660308286548)]}]

可以看到大致上能识别出的信息有

  • character
  • copyright
  • general
  • rating -> safe(safe, question, explicit)

在存入数据库的时候可以通过多添加一个数据列来展平数据(Flap Map),数据库 Schema:

loopid type value confidence source lang
ObjectId String String String String String
{
    _id: ObjectId,
    loopid: ObjectId,
    type: String, // ['character', 'copyright', 'general', 'safe']
    value: String,
    confidence: Number, //Double
    source: String, // default: 'illustration2vec'
    lang: String, // default: 'en'
}

简单脚本 [1]

import os
import sys
import logging
from tqdm import tqdm
from illustration2vec import i2v
from PIL import Image
from pymongo import MongoClient
from bson.objectid import ObjectId

from config import config

# Logger configure
logging.getLogger('illus2vec')
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logger = logging.getLogger()

IMAGES_PATH = config['images_path']

# Database initial
logger.info('Connecting to database...')
client = MongoClient("localhost", 27017)
db = client.animeloop_tags

# chainer models initial
logger.info('Loading chainer models...')
illust2vec = i2v.make_i2v_with_chainer(
    config['caffemodel'], config['tag_list'])


# save tags estimated from image file into dababase
def to_tags(filename, loopid):
    image = Image.open(filename)
    result = illust2vec.estimate_plausible_tags([image], threshold=0.5)[0]

    tag_shcema = {
        'loopid': ObjectId(loopid),
        'source': 'illustration2vec',
        'lang': 'en'
    }

    # Extract tags from database to memory
    # for performance optimization
    saved_tags = list(db.tags.find({'loopid': ObjectId(loopid)}))

    def exist_in_tagslist(loopid, type, value):
        for t in saved_tags:
            if str(t['loopid']) == loopid and t['type'] == type and t['value'] == value:
                return True
        return False

    for key in result.keys():
        for item in result[key]:
            tag = tag_shcema.copy()
            if key is 'rating':
                tag['type'] = 'safe'
            else:
                tag['type'] = key
            tag['value'] = item[0]
            tag['confidence'] = item[1]

            # Avoid saving duplicate data
            if not exist_in_tagslist(loopid, tag['type'], tag['value']):
                db.tags.insert_one(tag)

    db.tagscheck.insert_one({'loopid': ObjectId(loopid)})


# performance optimization
saved_tagscheck = map(lambda tc: str(tc['loopid']), list(db.tagscheck.find({})))

logger.info('Loading files list...')
files = os.listdir(IMAGES_PATH)

logger.info('Estimating tags')
progress_bar = tqdm(files, ascii=True, dynamic_ncols=True, bar_format='{l_bar}{bar}| {n_fmt}/{total_fmt}')
for file in progress_bar:
    if not os.path.isdir(file):
        filename = IMAGES_PATH + '/' + file
        loopid = os.path.splitext(file)[0]
        ext = os.path.splitext(file)[1]

        if not (ext == '.jpg' or ext == '.png'):
            continue

        if loopid not in saved_tagscheck:
            to_tags(filename, loopid)

考虑到数据的量特别的大,在写程序生成的时候,要使用到任务队列,以多个任务并发运行的方式提高综合速度。

一开始考虑到 File I/O 的问题,实际跑程序的过程发现,大量的时间是消耗在了 i2v 识别 tags 的时候,所以就没有再考虑加入多线程并行。

最后在 Animeloop 上展示出结果:

Animeloop illustration2vec tags


  1. https://github.com/moeoverflow/animeloop-illus2vec/blob/master/main.py ↩︎