K4: Searching TV Programs Using Teletext Subtitles

Subtitles Searching System Architecture

Citation
, XML
Authors

Abstract

TV programs and their accompanying subtitles are captured by a Hauppauge TV card. The programs are save to video files (.mpg format) and meanwhile the content of teletext channel 888 is written to subtitles files (plain .txt with start time and end time of each dialogue).

The subtitles file is then pre-processed and indexed by an index constructor, every “sensible” word in each sentence is inserted to database (index table).

On the User Interface (UI) end, a user submits a query from the web interface. It executes a jsp where a java bean is called to process the query term into desired format like we did in subtitles indexing. Then the processed query will be transmitted to a search engine (servlet), which executes some SQL operations on the aforementioned index table.

The search engine then ranks the results by data relev…

3. Subtitles Searching System Architecture

3.1 System architecture

Figure 5. System architecture

TV programs and their accompanying subtitles are captured by a Hauppauge TV card. The programs are save to video files (.mpg format) and meanwhile the content of teletext channel 888 is written to subtitles files (plain .txt with start time and end time of each dialogue). In Subtitles Search System, these two steps should be done manually. Automatically recording is left as a follow-on work in future improvement.

The subtitles file is then pre-processed and indexed by an index constructor, every “sensible” word in each sentence is inserted to an Access database (index table).

On the User Interface (UI) end, a user submits a query from the web interface. It executes a jsp where a java bean is called to process the query term into desired format like we did in subtitles indexing. Then the processed query will be transmitted to a search engine (servlet), which executes some SQL operations on the aforementioned index table.

The search engine then ranks the results by data relevance (involving frequency, word weight and adjacency). The ranked results are sent to program table to check the name of the video file, and then to the video file to set the start position for playback.

Finally the streaming data of relevant programs clips will be played on the web interface using windows media player.

Figure 6 shows the level 1 data flow diagram of the overall system.

Figure 6. Level 1 data flow diagram of Subtitles Search System

3.2 Unit design

Top-down behavioural model is applied in the design phase of Subtitles Search System. A data flow diagram (DFD) is an ideal graphical technique that describes information flow and the transformations that are applied as data move from input to output between sections in this system. It is used for both system design and unit design, focusing on the functional aspect of each section, because it helps answer the two questions “What will the functions of the system accomplish?” and “What are the interactions between those sections?”.

Hereafter shows some level 2 DFDs of the Subtitles Search System. They are unit design of the two kernel parts of the system: index constructor and search engine.

 

Figure 7. Level 2 DFD – index constructor

 

Figure 8. Level 2 DFD – search engine

3.3 Testing strategies

    There are two main tasks in our testing: unit testing and integration testing.

3.3.1 Unit Testing

1. Unit testing was carried out on the individual objects in our program and verified each to see if they have been implemented correctly.

2. Static Testing (White Box testing – structural) is used. White Box testing focuses on the program logic without regard to the software specification and is done during coding.

The programmers used the following techniques:

ÿ       Basic path testing – where each path was gone through at least once. For instance, in the Stemmer class, each suffix rule was tested at least once.

ÿ       Random testing – where a number of permutations of operations were tested. The testing data is selected randomly to pick up as more different permutations as possible.

For example, when testing the “rank results” transformer, try terms with different frequency (in a block) in one program, terms with same frequency (in a block) but different weight in one program, terms appeared in several programs, terms with same weight in different programs, and so on.

3.3.2 Integration Testing

1.    Black Box testing was mostly used (Functional Testing). This type of testing focuses on program input and output. For instance, it’s used in integration of all the functions of preprocessing and index subtitles. Each functions has been passed unit testing during coding and then be black-box tested when they are integrated together as a caption parser.

2.    Its purpose is to find:

ÿ       Incorrect or missing functions

ÿ       Errors not caught in unit testing

ÿ       Errors in data structures

ÿ       Performance (esp. in recall and precision evaluation of the search engine)

3.3.3 System Testing

As we used Java language to write the program, which is portable in different operating systems, it can be run on any PC installed JVM. It can be run on Operating Systems such as Windows NT, 2000, Solaris Unix and Linux.

However, the web functions are supported by Jakarta Tomcat 3.2.3, which I installed on Windows 2000. Therefore only the administrator’s part, i.e. to add a TV program (including accompanying subtitles and program information) to the system, is portable. Other parts work only on Windows 2000 stations installed Tomcat server.

3.4 User manual

The Subtitles Search System is a TV programs searching site that makes locate specific TV program clips in response to user’s queries.

This manual has been designed to teach the inexperienced user how to use the features offered by the system. Let’s begin our tour of the Subtitles Search System with the administration part.

3.4.1 Administration part – adding a program

The administration part of this system has a command line interface. Administrator should add a TV program using a command offered by the system from command line, whether Unix terminals or Windows’ DOS command prompt.

Given a .mpg format video file and it’s accompanying subtitles transcript .txt file (or .ssa format subtitles file like we used in the implementation), the administrator should input the file names (with its path if appropriate) as parameter of the command.

q Set environment for jdk 1.3.0_02 and Jakarta tomcat 3.2.3, and start up tomcat

Run a batch file called sss.bat in DOS command prompt, or a shell script file called sss.sh in Unix/Linux environment. The content of the batch/script file is listed as the following: (The batch file is taken as the example cause Windows 2000 platform is recommended.)

set PATH=%PATH%;C:Program Filesjdk1.3.0_02bin

 
set TOMCAT_HOME=H:jakarta-tomcat-3.2.3

set JAVA_HOME=C:program filesjdk1.3.0_02

set CLASSPATH=H:jakarta-tomcat-3.2.3libservlet.jar;.;H:jakarta-tomcat-3.2.3webappsexamplesWEB-INFclasseseunice

 
rem —–start the batch file offered by tomcat 3.2.3—-

cd jakarta-tomcat-3.2.3bin

startup.bat

When the sss.bat finishes, a Dos prompt window “Tomcat 3.2” appears. When the message “Starting HttpConnectionHandler on 8080” and “Starting Ajp12ConnectionHandler on 8007” displays on the window, tomcat server is ready and you can view web pages via http://localhost:8080/. 

Figure 9. Start up Tomcat 3.2.3

When tomcat is started, the current directory of your DOS prompt will be changed to [tomcat installation directory]bin. If you install tomcat at H:jakarta-tomcat-3.2.3, your current working directory will be H:jakarta-tomcat-3.2.3bin after starting up tomcat.

Go to the main page of the Subtitles Search System by inputting http://localhost:8080/examples/jsp/eunice/search2.html 

Figure 10 Search page of the Subtitles Search System

Click “Administration page” on it and go to the administration page where a new program can be added to the system.

Figure 11 Administration page of Subtitles Search System

Fill in the title of the television program and browse the mpg file and its subtitles file, then click “Add the program”.

If the subtitles file’s format matches ssa format (see 4.1.1) and the program is not in the collection already, a successful message (Figure 12) will response after the subtitles is indexed, video file is copied to system directory, and program table is updated.

This page triggers a java servlet named Adequacy.class which does three things:

1.      Call IndexConstructor.class to index the subtitles file and add words’ entries to the index table.

2.      Execute an operating system command to copy the assigned video file to the system movie directory.

Upload function is not used in this part cause normal users are not allowed to add programs to the system. Only administrator has the adding access. Moreover, the original video file is expected on the same machine of the server (the station on which tomcat is installed) since capturing TV programs requires specific hardware (Hauppauge TV card for this system).

The video file of TV programs should be saved as mpeg format and copied to a specific directory, all mpeg files should be copied to

C:Documents and Settingsn0700958mov

3.      Insert the program information (title, Id, and video file name) into the system program table.

Figure 12 A program is added to the system successfully

For instance, if you want to add a program “Test video: on class”, the subtitles file classc.txt is at F:dissertationsubtitles, and mpg file class.mpg is at F:. (see Figure 11)

As you see Figure 12, classc.txt has been indexed and inserted into the system index table, class.mpg has been copied to c:Documents and Settingsn0700958mov, and a new entry of this program has been inserted to program table successfully. If your subtitles file is large (for a long television program), the response time may be longer cause it takes time to index big file and write a large amount of keywords to database.

If you would like to check if the subtitles file has been inserted into the system index table, you can look at the database configured in Data Source (ODBC). Figure 13 and 14 list the original subtitles of a short clip and its index in the database.


 

Dialogue: Marked=0,0:00:01.00,0:00:03.20,*Default,,0000,0000,0000, So. hmm, are there any other questions

Dialogue: Marked=0,0:00:03.20,0:00:05.20,*Default,,0000,0000,0000, about the Aboriginal flag

Dialogue: Marked=0,0:00:05.80,0:00:07.00,*Default,,0000,0000,0000, yeah, why does it look like a fried egg

Dialogue: Marked=0,0:00:07.00,0:00:08.30,*Default,,0000,0000,0000, What I meant to ask was

Dialogue: Marked=0,0:00:08.50,0:00:11.20,*Default,,0000,0000,0000, What the different part of the flat was present

Dialogue: Marked=0,0:00:12.10,0:00:13.00,*Default,,0000,0000,0000, This is a good question. I can show

Figure 13. Original SSA format subtitles file class.txt

 

index

keyword

progId

startTime

blockNo

freq

aborigin

clas

0:00:03.2

0

1

ask

clas

0:00:07.0

0

1

egg

clas

0:00:05.8

0

1

flag

clas

0:00:03.2

0

1

flat

clas

0:00:08.5

0

1

fried

clas

0:00:05.8

0

1

good

clas

0:00:12.1

0

1

hmm

clas

0:00:01.0

0

1

look

clas

0:00:05.8

0

1

meant

clas

0:00:07.0

0

1

part

clas

0:00:08.5

0

1

question

clas

0:00:01.0

0

1

question

clas

0:00:12.1

0

1

show

clas

0:00:12.1

0

1

Figure 14. Entries of class.txt in index table

Figure 15 is the program table of the system. It shows that the program “Test video: On class” has been added to it. Figure 16 shows that the video file class.mpg has been copied to the system video collection directory.

Figure 15 Program table of Subtitles Search System

Figure 16 Video collection of Subtitles Search System

3.4.2 User interface – searching keywords through programs

3.4.2.1 Search page

Figure 17. Search page of Subtitles Search System

Fig.13 is the search page of SSS. User can input any words, which he wants to earch in the transcript, in the text field.

The features of the search engine of SSS are illustrated in the form of 2.1 in the following table.

Search Engines

Boolean

Default

Proximity

Truncation

Case

Fields

Stop

Sorting

SSS

OR

OR

No exact phrase

No

No

TV subtitles

Yes

Relevance

Table 2. Features of search engine in SSS

Boolean operators – Only boolean OR is supported, user need not assign any boolean operators between words. The system will automatically execute an OR search if the keywords are more than one. Boolean operators include AND, NOT, OR, parentheses and +, – are stopped when query term is processed.

Exact phrase searching – Because no exact phrase searching is supported, it’s not necessary to enclose a phrase in quotation marks, i.e. “thousand years”, though the quotation will be removed as punctuation when query term is processed. Actually, thousand OR year will be searched as keywords. The result might not be the exact words in sequence.

Usually, this does not influence the precision of searching since the results are ranked by relevance. In the situation of subtitles searching, in most conditions the sentences containing the exact phrase searched will be listed in former part of the result page.

Truncation – No wild card allowed in the Subtitles Searching. If you input colo* or colo? in the text field, the engine will search for colo since the wild card * or ? will be stopped as punctuation. If you place wild cards in the middle of a word, e.g. thous*nd (for thousand), the engine will search for thousnd and you won’t get any matching result.

Case sensitive – the search engine of Subtitles Searching System is case insensitive.

Stop – stop list of this system is listed at Appendix B.

Sorting – ranked by relevance. For relevance measurement please see 4.5.

3.4.2.2 Result page

When user clicks the button “Search programs”, it starts to process query term and search the index table then. In Fig.13, the user searches for Good Questions, which will be processed to fold upper case and remove plural suffix –s. After processing, it changes to good question, see Fig. 15.

Figure 18. Result page of Subtitles Search System

Ranked results are listed under the searching keywords. Each result composes of two parts: program ID and start position of the block in which all/part of the keywords are said.

In the example, the result with highest relevance is clas 0:00:01.0. clas is the program ID of the program that we just added in 3.4.1. The first four letters of the program files are used as the program ID. So the first result indicates that the program clip is the block starting from the 1st second of a program whose ID is clas.

All the results are hyperlinks. Clicking the link will lead to playback the corresponding program clip. If we click clas 0:00:01:0, the playback page displays as Fig.19.

3.4.2.3 Playback page

Using Windows Media Player, the system playbacks the selected clip on the playback page. The start position is set to the time displayed on the link.

In our example, the video is played starting from the 1st second of class.mpg. If you wear headsets, you can hear good once and question twice in this clip. (Since this is a test video lasting for 14 seconds, it has only one clip in the program.)

Figure 19. Playback page of Subtitles Search System

You can also control the playing by accessing the control panel under the video area. You can pause, stop, or replay the video by click relevant buttons that are similar to those on a cassette. This control component is provided by Window Media Player itself.

The status bar under the control panel shows the status (“playing” or “stopped”) and time information of the video file. If you pay attention to the time information at the beginning of playback, you may note that the starting position is exactly the time listed on the hyperlink of result page.

3.4.2.4 Browse programs

If you click the link of “Browse programs” on the search page (Figure 17), you will go to the browse page (Figure 20) to check all available TV programs in the collection of this system. You can also view the video if you click the links on the video file names.

This pages actually just prints out the content of program table and add a playback hyperlink on each video file name to allow users to view the whole program. 

Figure 20 Browse page of Subtitles Search System