K2: Searching TV Programs Using Teletext Subtitles

1.3 Introduction to Subtitles Search System



An introduction to subtitles search system, which involves three main tasks –indexing, searching and retrieval.

1.3 Introduction to Subtitles Search System

1.3.1 Project overview

Since the testing data source for the system is limited because of hardware restriction, it only allows practical recording, storage, and playback of hundreds of megabytes of multimedia data. But it has the ability to enlarge given enough hardware supports.

This work is a step towards practical retrieval of multimedia documents, where the information content is achieved from speech recognition performed on the audio soundtrack and some image analysis as well.

In order to produce such a system involves solving three main tasks – indexing, searching and retrieval.


Most indexing systems include routine pre-processing, e.g. to remove punctuation, standardise the treatment of abbreviations and cases, etc, and we have done this too.

A large-scale lexical database is constructed for the search engine. Though our retrieval test collections are all small, the index table is big. Even for a 20 minutes programme, about a thousand entries is inserted into index after processing the source subtitles, which contributes to reducing redundant words such as stop words.

The indexing task includes formatting subtitles file (filter useless information for searching and retrieval), case folding, removing punctuation, stopping, and stemming. In stopping, function and other “useless” articles are eliminated so operational indexing and searching is confined to content words. We have combined some standard stop lists and slightly modified to suit our needs. (see Appendix B) Stemming normalises term form, and we have rewrite the Porter stemmer [port] to java bean. The indexing vocabulary after pre-processing, stopping and stemming, is then inserted into the index table which holds all words’ roots in the video collection.

Searching and ranking

The search engine of Subtitles Search System mimics web search engines. Most functions are done by SQL operations on JDBC-ODBC-Microsoft Access. The relevance assessments are done on simple search output only with a default Boolean “OR”.

Video document retrieval

On the document retrieval side the work has studied the methods as used in the mg system [witt], developing this to search on the video data file with information gathered from a parallel text corpus. A weighting function is used for retriever modification.

Our studies fall into two major blocks. The first covers basic indexing, the second is web multimedia technologies.

1.3.2 Project management

Exploratory programming model is adopted to develop this system since it’s done by only one person and who is herself able to develop software.

In this model the idea is to develop a working model (or not-throwaway prototype) as quickly as possible which is then modified until it does what it is supposed to do. The prototype is build once and it is gradually improved, refined and thus it is increment driven; unlike the Waterfall model which is document driven.

Exploratory programming is best suited for systems where it is difficult to establish detailed system specifications. Validation here does not exist and rather the programs which are created are checked for adequacy. Although this model has been little used in software development because the management techniques that currently exist are not adequate to manage this model, and the programs resulting from this tend not to be well-structured, it is suitable for a small-scale project like the Subtitles Search System and it is more natural for project management because it imitates human behaviours.

Additional, some risk analysis is considered in the design & plan phase to identify and resolve the probable risks.

1.4 Report contents

In background (chapter 2) past related word dealing with web search engines, video document retrieval systems, the individual components of the system and web multimedia techniques are discussed. This is followed by an overview of the Subtitles Search System (chapter 3), which was produced.

Next implementation of individual parts of the system is described (chapter 4). The first part of the system involves pre-processing and indexing subtitles, as detailed in section 4.1. The index must then be inserted into an existing database and so as for video file information using SQL operation. The data structure of the database is defined in section 4.2. The kernel, search engine of the system is described in searching (section 4.5). The output of searching is ranked by relevance then. Several relevance assessment strategies are discussed in section 4.6. The playback task (section 4.7) uses Windows Media Player API after comparing it with Java Media Framework (JMF). Other tasks like query term processing are covered in respective sections of chapter 4.

Evaluation of the system (section 5.1) is carried out in terms of the recall and precision of the retrieval. Finally conclusions about the techniques employed and suggestions for further work and improvements to the system are given (section 5.2).