K1: Searching TV Programs Using Teletext Subtitles

1. Introduction: Subtitles Search System



A video document retrieval system automatically indexes from subtitles and then retrieves relevant clips from a large collection of video recordings in response to a user query.


[This project has been done several years ago by Minhua Eunice Ma. And by her permission, I publish part of its contents, but her 99-pages-long dissertation can’t be included in one knol, so I may publish those parts in several Knols, and may also provide PDF for downloading. ]
The object of this Knol is the analysis, design and implementation of a video search engine prototype (Subtitles Search System). The video documents are from TV programs with subtitles.
Searching a database of TV programs for specific content by trying to analyse the images and sounds still needs to be improved. Speech recognition, image recognition and computer vision understanding will be concerned in this situation. In theory, this sounds great, making video as easy to search as Web pages through a search engine. Unfortunately these technologies are still in the experimental stage.
However, many programmes have accompanying TeleText subtitles, and this offers an opportunity to find footage by searching for keywords in the subtitles.
The overall aims of this project are for this. It allows full content search and retrieval of video. The system performs a number of functions. It records TV programs and the respective subtitles of the TeleText. A server with a TV card is used to capture programs and subtitles. A web-based search engine allows users to type in keywords, and view associated clips.
The whole system is a java-based hybrid MCV model combining individual java programs, servlets, jsp and beans with web interface.
It has a pre-processing section, which includes functions like stop words and lexical stemmer, for index construction. The search engine uses a statistical ranking model (frequency-weight-adjacency) for results evaluation to increase the accuracy of search. It shows that statistical methods developed for text retrieval are also effective for retrieving and browsing multimedia documents. And the playback function is achieved by Microsoft Windows Media Player API.
Keywords: Video document retrieval, information retrieval, TV programmes, subtitles, search engine, multimedia, content-based indexing.

1. Introduction

The Subtitles Search System deals with the problem of finding all the relevant documents in a video collection of TV programs for a given user’s query.

1.1 Motivation

The purpose of a video document retrieval system is to automatically index and then retrieve relevant items from a large collection of video recordings in response to a user query. The video documents will be TV programme with accompanying subtitles.

Potential users of the system are likely to watch video on their computer instead of on their TV, examine individual clips within program store rather than entire programs. This project is a step towards Video On Demand (VOD) and interactive TV. Customers can watch video whenever and whatever they want. However VOD and iTV are still a long way too go.

Though devices like the Tivo can time-shift TV and record it, they can only hold 30 hours. The web, especially with a broadband connection, can hold unlimited amounts of video.

To retrieve desired information from such huge video repositories, a search engine is demanded. Some cursive video search web sites are based on videos’ description. Either a trailer of a relevant film or a complete version is playbacked as one result. User cannot skip around in the movie and find the scene they are interested.

Since many television programmes (more than 70%), movies on DVD and pre-recorded video tapes have specially coded closed caption subtitles, the feature that interests deaf and hard of hearing users, makes feasible to search by keywords in the subtitles without any requirements on voice or image recognition.

On television programs, subtitles are provided on teletext page 888. It is available on ITV, Channel 4, Channel 5 and BBC1 and 2. Both programs and their accompanying subtitles can be captured by a TV card installed on a workstation. It’s possible to use currently available devices to get video source and subtitles as well and develop a web-based search engine for video documents, enabling users to access specific program clips in response to the keywords they input.

1.2 Objectives

The Subtitles Search System will be expected to meet the following objectives.

Index construction

A main element of work of this project is the index construction task. It involves a series of pre-processing on subtitles before writing keywords into database. It should be case and inflection insensible for the keywords. The change of word’s form (mostly suffix) for distinctions as its number, tense, person, and mood must not be distinctive in searching.

In addition to recording the root of every ‘sensible’ word to database, the size of the index should be made as small as possible, to speed up searching and save disk space in the condition of not impairing search precision.

Block segmentation

To playback a related video clip corresponding to user’s query, a video file need to be segmented. Conventional techniques to segment written text is “chapterization”, which can not be applied to segment subtitles. Instead non-lexical information extracted from the start time and end time of dialogues gives an indication of the nature segmentation of a block.

Search for relevant blocks and ranking results

To locate information in the video in response to user’s query and rank matching results.

Web user interface and playback related clips only

In order to increases the accessibility of the Subtitles Search System to as many users as possible, it will be accessed through the Internet using a web browser. In addition to good usability of the web interface the playback of selected programme clips must be as fast as possible, therefore streaming media technology will be used when user selects one link on the result page, starting from the start time of the block where the searching keyword(s) appears.

To facilitate TV programs’ management, web interface for administrator of the system is also provided. Administrator can add programs via the web page.