K5: Searching TV Programs Using Teletext Subtitles

Indexing, Searching and Playback on Web Interface – Implementation

Citation
, XML
Authors

4. Indexing, Searching and Playback on Web Interface – Implementation

4.1 Index constructor

4.1.1 Format of subtitles

The test subtitles files we use in implementation and testing is Sub Station Alpha (SSA) format downloaded from http://gsoft.smu.edu/ScriptCrypt/crypt.html

Sub Station Alpha is a freeware program for video titling and subtitling which runs under Windows on a PC. .ssa file is the standard format which can be used as an input file of the program.

The format of Sub Station Alpha (.ssa)

Dialogue: Marked=0,0:08:35.0,0:08:35.7,*Default,,0000,0000,0000,As you wish.

Dialogue: Marked=0,0:08:37.0,0:08:38.8,*Default,,0000,0000,0000,This is Cefiro.

Dialogue: Marked=0,0:08:38.8,0:08:43.2,*Default,,0000,0000,0000,You have been summoned bynPrincess Emeraude to becomenMagic Knights and save Cefiro.

Dialogue: Marked=0,0:08:44.1,0:08:47.7,*Default,,0000,0000,0000,Why are you talking like a snob?nAren’t you younger than us?

Dialogue: Marked=0,0:08:47.7,0:08:49.3,*Default,,0000,0000,0000,Let go!  Stop joking around!

The information other than start time, end time and dialogue will be purged in pre-processing.

In addition to the spoken content of television programs, subtitles contain data regarding changes of speaker (indicated by different colors) and background noises.

Because the subtitles files are captured by a programme from VT2000, format could be tailored according to requirements in some degree.

A tagged format containing color information of subtitles might be useful for segmentation and identifying commercials.

0:08:38.8,0:08:43.2, <yellow>Man, what you supposed to be doing here? </yellow> <green>Forging a new home for mankind in the depths of space.</green>

However this is not available for this project because of some technical limitations, .ssa format is used in the implementation and testing of the system.

4.1.2 Preprocess subtitles file – remove redundant information

4.1.2.1 Case folding

In other words, it changes all letters to lower case (or uppercase) to avoid case mismatching.

Case folding inevitably causes a problem: missing acronym. For example, POP (Post Office Protocol) is same as pop after case folding, which causes a lot of irrelevant search results. Moveover, if the lower case of an acronym happens to be same as a stop word, e.g, IE (Microsoft Internet Explorer) were not searchable since ie (Latin, that is) is removed as a stop word from the subtitles file.

Most web search engines supporting case insensitive encounter this problem too. It seems there’s no way to have it both ways except checking an acronym list when folding case. However, it isn’t worthwhile to do that cause the irrelevant result from acronyms is few.

The implementation of case folding is quite straightforward since it’s a String method in java.lang.String.

      s3=s3.toLowerCase();          // case folding

(see Appendix A: IndexConstructor.java)

4.1.2.2 Stopping words

A very small fraction of the total vocabulary appears in about 30 percent of corpus. Normally the more frequent a word appears, the less information it bears in searching. This is the theoretical base of stopping.

A stop list is generated for Subtitles Search System that is partially based on some existing lists from the Internet [dtic00]. For different content of corpus, there should be some adjustment in stop list cause some words “meaningful” for one corpus might be “senseless” for another. For instance, travel may be meaningless for a travel agency collection, but significant for the Subtitles Search System. This system has no specific words that need to be stopped other than those most frequently used words in general English language corpus.

Index constructor assumes that its users will not be interested in these terms. Two problems may occur therefore.

A famous example is to search “To be or not to be” in Shakespeare’s concordance. All these words made up this query are stopped. It’s impossible to find it in index. The only way to locate the phrase was to search the entire text.

The other problem is homonym. Some of the most common words, for example, can, may, will, are verbs that are homonyms for rarer nouns. If the verb will is stopped, the noun with meaning “a declaration of a person’s wishes of disposal of his property after death” is also stopped.

There are two ways that can solve the problems. The first is including stop words in an index using a suitable representation that can make location efficiently, meanwhile enable exact phrase searching. The second is semantic database referred in 2.5.1.

The implementation of stopping concerns two methods. ReadStopList() reads stop words from a file to an array of String. RemoveWords() removes stop words from a string.

 

private String removeWords(String s, String[] words)

{

    StringTokenizer st = new StringTokenizer(s,” “);

    String word, line = ” “;

    while (st.hasMoreTokens())

    {

      word = st.nextToken();

      boolean found = false;

      for (int i=0; i<words.length && !found; i++)

      {

        if (word.equals(words[i]))

        found = true;

      }

      if (!found)

      line = line+word+” “;

    }

    return line;

}

 

private void readStopList(String lst, String[] words)

{

    try

    {

      fStopLst = new BufferedReader(new FileReader(lst));

      String word;

      int i = 0;

      while (( word=fStopLst.readLine() ) != null)

      {

        words[i++] = word;

      }

      fStopLst.close();

    }

    catch (IOException ex)

    {

      System.err.println(“File error: “+ex.toString());

      System.exit(1);

    }

}

 

/* in calling method */

readStopList(STOPLIST_FILE,stopWords);  //read stop list to array stopWords

s3 = removeWords(s3,stopWords);         // stop words

4.1.2.3 Removing punctuation

Punctuation is removed twice in processing: one before stopping, the other after stopping.

Before-stopping removing takes all punctuation off except apostrophe since words like you’d, it’s, don’t, where’s should be eliminated in stopping.

After stopping, only apostrophe s are remained. They are cleared away then.

 

s3 = removePunc(s3);            // remove punctuation except apostrophe

s3 = removeWords(s3,stopWords); // stop words

s3 = removeApos(s3);            // remove ‘s

 

private String removePunc(String s)

{

  String noPuncString=””;

  for (int i=0; i<s.length(); i++)

  {

    char c = s.charAt(i);

    if ((c<‘!’|| c>’/’ || c==”’ || c==’-‘) && c!=’?’ && c!=’:’ && c!=’;’)

        noPuncString = noPuncString+c;

    else

        noPuncString = noPuncString+” “;

  }

  return noPuncString;

}

 

private String removeApos(String s)

{

    int loc = s.indexOf(“‘s”);

    String noApos = s;

    if (loc>0)

      noApos = s.substring(0,loc)+s.substring(loc+2);

    return noApos;

}

Having executed the three statements of removePunc(), removeWords and removeApos(), some dialogues becomes empty since all the elements in the dialogue are eliminated, such as the well-know one “To be or not to be.”

Hence the next step is removing empty lines caused by stopping.

      if (!isEmptyLine(s3))         // remove empty lines

      {

        s3=parse(s3);

        insertDB(s3,s1,block,insertWordStmt,updateFreqStmt);

      }

Only if a line is not empty, further indexing will execute to insert keywords entries into inverted index table. Otherwise just do nothing and the enclosing program within a loop will continue to process the next line.

 

private boolean isEmptyLine(String s)

{

    if (s.length()==1)

      return true;

    else

      return false;

}

The condition of if is s.length()==1 instead of s.length()==0, because even an empty line has a space character which works as delimiter of word boundary when it’s not empty.

4.1.2.4 Stemming

Stemming related to truncation. It usually refers to the ability of a search engine to find word variants such as plurals, singular forms, past tense, present tense, etc. Some simple stemming only covers plural and singular forms. Sophisticated stemming can remove all suffixes and other inflections such as substantivizational suffixes.

Actual implementation of a stemmer requires a detailed knowledge of English. The suffixes –s, -ed, -ing, -ly, etc. are easy. But then exceptions to these simple rules are found and further rules added to prevent them from being stripped; then exceptions to the exceptions are identified, and so on. The stemmer of Subtitles Search System is based on Porter stemmer [port] which has about 80 rules and exceptions.

It might be more suitable to implement this process using some logical programming language (e.g. CLIPS) because it uses rule-firing model. Here java is used to keep consistent of the whole system and to facilitate using by implement the stemmer as a JavaBean.

 

public class Stemmer

{

  private String originalWord = null;

  private String root         = null;

  private int ruleId          = 0;

  private final static String LAMBDA = “”;

 

  // static RuleList here

 

  public Stemmer() { }  // Constructor, do nothing

 

  /* bean setter */

  public void setOriginalWord(String w)

  {

    originalWord = w;

  }

 

  /* bean getter */

  public String getRoot()

  {

    if ( !originalWord.equals(null) )

    {

      root = “”;

      StringTokenizer st = new StringTokenizer(originalWord,” “);

      while (st.hasMoreTokens())

      {

        String anOriginalWord = st.nextToken();

        root = root + stem(anOriginalWord) + ” “;

      }

    }

    return root;

  }

 

  public String getOriginalWord()

  {

    return originalWord;

  }

 

  /* stem() */

  public String stem(String s)

  {

      // stem the input String

  }

}

Bean setter and getter provide a way to access private properties of the class stemmer. So either JSP or other java programs can use the program. In this system, the stemmer will be used in stem.jsp to process query terms and in IndexConstructor.java to index subtitles.

The stem.jsp reuse the program by set the property originalWord and get the property root after the words have been stemmed.

The rules for removing suffixes are listed as the following:

 

  static RuleList[] step0_rules={

        new RuleList(218,  “iveness”, “ive”,   6,  2,  1),

        new RuleList(219,  “fulnes”,  “ful”,   5,  2,  1),

        new RuleList(220,  “ousness”, “ous”,   6,  2,  1),

        new RuleList(309,  “ness”,    LAMBDA,  3, -1,  1),

      };

 

  static RuleList[] step0a_rules={

        new RuleList(419,  “ous”,  LAMBDA,  2, -1,  1)};

 

  static RuleList[] step1a_rules={

        new RuleList(101,  “sses”, “ss”,   3,  1, 0),

        new RuleList(102,  “ies”,  “y”,    2,  0, 1),

        new RuleList(103,  “ss”,   “ss”,   1,  1, 0),

        new RuleList(104,  “s”,    LAMBDA, 0, -1, 0),

      };

 

  static RuleList[] step1b_rules={

        new RuleList(105,  “eed”, “ee”,    2,  1, 1),

        new RuleList(106,  “ed”,  LAMBDA,  1, -1, 1),

        new RuleList(107,  “ing”, LAMBDA,  2, -1, 1),

      };

 

  static RuleList[] step1b1_rules={

        new RuleList(108,  “at”, “ate”,   1,  2, 0),

        new RuleList(109,  “bl”, “ble”,   1,  2, 0),

        new RuleList(110,  “iz”, “ize”,   1,  2, 0),

        new RuleList(111,  “bb”, “b”,     1,  0, 0),

        new RuleList(112,  “dd”, “d”,     1,  0, 0),

        new RuleList(113,  “ff”, “f”,     1,  0, 0),

        new RuleList(114,  “gg”, “g”,     1,  0, 0),

        new RuleList(115,  “mm”, “m”,     1,  0, 0),

        new RuleList(116,  “nn”, “n”,     1,  0, 0),

        new RuleList(117,  “pp”, “p”,     1,  0, 0),

        new RuleList(118,  “rr”, “r”,     1,  0, 0),

        new RuleList(119,  “tt”, “t”,     1,  0, 0),

        new RuleList(120,  “ww”, “w”,     1,  0, 0),

        new RuleList(121,  “xx”, “x”,     1,  0, 0),

      };

 

  static RuleList[] step2_rules={

        new RuleList(203,  “ational”, “ate”,   6,  2,  1),

        new RuleList(204,  “tional”,  “tion”,  5,  3,  1),

        new RuleList(205,  “ency”,    “ence”,  3,  3,  1),

        new RuleList(206,  “ancy”,    “ance”,  3,  3,  1),

        new RuleList(207,  “izer”,    “ize”,   3,  2,  1),

        new RuleList(208,  “ably”,    “able”,  3,  3,  1),

        new RuleList(209,  “ally”,    “al”,    3,  1,  1),

        new RuleList(210,  “ently”,   “ent”,   4,  2,  1),

        new RuleList(211,  “ely”,     “e”,     2,  0,  1),

        new RuleList(225,  “ily”,     “y”,     2,  0,  1),

        new RuleList(213,  “ously”,   “ous”,   4,  2,  1),

        new RuleList(224,  “ly”,      LAMBDA,  1, -1,  2),

        new RuleList(214,  “ization”, “ize”,   6,  2,  1),

        new RuleList(215,  “ation”,   “ate”,   4,  2,  1),

        new RuleList(216,  “ator”,    “ate”,   3,  2,  1),

        new RuleList(217,  “alism”,   “al”,    4,  1,  1),

        new RuleList(221,  “ality”,   “al”,    4,  1,  1),

        new RuleList(222,  “ivity”,   “ive”,   4,  2,  1),

        new RuleList(223,  “bility”,  “ble”,   5,  2,  1),

      };

 

  static RuleList[] step3_rules={

        new RuleList(301,  “icate”, “ic”,    4,  1,  1),

        new RuleList(302,  “ative”, LAMBDA,  4, -1,  1),

        new RuleList(303,  “alize”, “al”,    4,  1,  1),

        new RuleList(304,  “icity”, “ic”,    4,  1,  1),

        new RuleList(305,  “ical”,  “ic”,    3,  1,  1),

        new RuleList(308,  “ful”,   LAMBDA,  2, -1,  1),

      };

 

  static RuleList[] step4_rules={

        new RuleList(401,  “al”,   LAMBDA,  1, -1,  2),

        new RuleList(402,  “ance”, LAMBDA,  3, -1,  2),

        new RuleList(403,  “ence”, LAMBDA,  3, -1,  2),

        new RuleList(425,  “ier”,  “y”,     2,  0,  1),

        new RuleList(405,  “er”,   LAMBDA,  1, -1,  2),

        new RuleList(406,  “ic”,   LAMBDA,  1, -1,  2),

        new RuleList(407,  “able”, LAMBDA,  3, -1,  2),

        new RuleList(408,  “ible”, LAMBDA,  3, -1,  2),

        new RuleList(409,  “ant”,  LAMBDA,  2, -1,  2),

        new RuleList(410,  “ement”,LAMBDA,  4, -1,  2),

        new RuleList(411,  “ment”, LAMBDA,  3, -1,  2),

        new RuleList(412,  “ent”,  LAMBDA,  2, -1,  2),

        new RuleList(423,  “sion”, “s”,     3,  0,  2),

        new RuleList(424,  “tion”, “t”,     3,  0,  2),

        new RuleList(415,  “ou”,   LAMBDA,  1, -1,  2),

        new RuleList(416,  “ism”,  LAMBDA,  2, -1,  2),

        new RuleList(417,  “ate”,  LAMBDA,  2, -1,  2),

        new RuleList(418,  “iti”,  LAMBDA,  2, -1,  2),

        new RuleList(419,  “ous”,  LAMBDA,  2, -1,  2),

        new RuleList(420,  “ive”,  LAMBDA,  2, -1,  2),

        new RuleList(421,  “ize”,  LAMBDA,  2, -1,  2),

      };

 

  static RuleList[] stepAddE_rules={

        new RuleList(122,  LAMBDA, “e”,  -1,  0, 0)};

 

  static RuleList[] step5a_rules={

        new RuleList(501,  “e”,  LAMBDA,  0, -1, 3)};

 

  static RuleList[] step5c_rules={

        new RuleList(502,  “e”,  LAMBDA,  0, -1, 0)};

 

  static RuleList[] step5b_rules={

        new RuleList(503,  “ll”, “l”,  1,  0, 2),

        new RuleList(504,  “mm”, “m”,  1,  0, 1),

      };

Every array of RuleList contains a group of rules. In each group, only one rule can be fired at a time. Therefore the rules listed former in a group have higher priority than the latter ones in the same group. That’s the reason that RuleList 103 exists.

 

  static RuleList[] step1a_rules={

        new RuleList(101,  “sses”, “ss”,   3,  1, 0),

        new RuleList(102,  “ies”,  “y”,    2,  0, 1),

        new RuleList(103,  “ss”,   “ss”,   1,  1, 0),

        new RuleList(104,  “s”,    LAMBDA, 0, -1, 0),

      };

Actually, it does nothing by replacing “ss” with itself. But if the rule is left out, words like “guess” would be stemmed to “gues” by triggering rule 104.

For the same reason, long suffixes usually have higher priority than short ones with same endings. For example,

        new RuleList(425,  “ier”,  “y”,     2,  0,  1),

        new RuleList(405,  “er”,   LAMBDA,  1, -1,  2),

Rule 425 appears in front of rule 405. Otherwise, words like “easier”, “tinier” would be stemmed to “easi” and “ tini” instead of their correct roots “easy” and “tiny”.

Syllable numbers plays important roles in the stemmer. The last member variable of RuleList is minimal syllable numbers (min_root_size), which decides whether a rule can be fired or not when a word ending matches the old suffix of the rule.

 

class RuleList

{

  public int    id;           // returned if rule fired

  public String old_end;      // suffix replaced

  public String new_end;      // suffix replacement

  public int    old_offset;   // from end of word to start of suffix,

                              // ie. (length of old_end) – 1

  public int    new_offset;   // from beginning to end of new suffix,

                              // ie. (length of new_end) – 1

  public int    min_root_size;  // min root wordSize for replacement

 

  public RuleList3(int ruleId, String old, String newOs, int oldL,

                   int newL, int minRoot)

  {

    id           = ruleId;

    old_end      = old;

    new_end      = newOs;

    old_offset   = oldL;

    new_offset   = newL;

    min_root_size= minRoot;

  }

}

To protect letters which is part of a root be removed as a suffix, min_root_size is set according to some lexical rules. Take rule 405 as an example again (RuleList(405,  “er”,   LAMBDA,  1, -1,  2)), setting min_root_size to 2 can protect “anger”, “peer”, “tier” from being stemmed to “ang”, “pe”, “ti” since they have only one or two syllables, not greater than the min_root_size. One condition of firing a rule is that the syllable number of the word must be greater than the min_root_size of the rule.

However, there is a trade-off between precision and coverage here. Some two syllable words (e.g. “writer”, “fighter”) cannot be stemmed by firing this rule if min_root_size equals 2. They are not covered by the stemmer and will left unchanged. If cases like these are required to be covered, it will be suffered from the loss of precision in other cases like “anger”.

Higher precision is preferred in this system since lower coverage of stemmer doesn’t influence searching result except occupying more disk space for index terms and hence less efficient.

The cost of saving memory and speed up search time is the loss of exact searching. Hence it allows a little leeway when matching query terms to index terms so that, for example, “emergency”, “emerging” and “emerged” are all accepted as equivalent to “emerg”.

The core method of Stemmer is stem(), shown in the following program.

 

/********* stem() *********/

  public String stem(String s)

  {

    s = processB4Stem(s);

    for (int i=0; i<s.length(); i++)

      if ( !Character.isLetter(s.charAt(i)) )

          return s;

 

    /* Run through the Porter algorithm */

    s = replaceEnd( s, step0_rules );

    s = replaceEnd( s, step0a_rules );

    if ( ruleId==419 && addAnE(s))

      s = replaceEnd( s, stepAddE_rules );

 

    s = replaceEnd( s, step1a_rules );

    s = replaceEnd( s, step1b_rules );

    if ( 106 == ruleId || 107 == ruleId )

      s = replaceEnd( s, step1b1_rules );

    if ( (106 == ruleId||107 == ruleId) && addAnE(s))

      s = replaceEnd( s, stepAddE_rules );

 

    s = replaceEnd( s, step2_rules );

    s = replaceEnd( s, step3_rules );

    s = replaceEnd( s, step4_rules );

    if ( (ruleId==405||ruleId==407) && addAnE(s))

      s = replaceEnd( s, stepAddE_rules );

 

    s = replaceEnd( s, step5a_rules );

    s = replaceEnd( s, step5b_rules );

    return s;

  }

 

/********* ********* ********* ********* ********* *********

*                        replaceEnd()

* Returns: int — the id for the rule fired, 0 is none is fired

*

* Purpose: Apply a set of rules to replace the suffix of a word

*

* Plan:    Loop through the rule set until a match meeting all

*          conditions is found.  If a rule fires, return its id,

*          otherwise return 0. Connditions on the length of the root

*          are checked as part of this function’s processing because

*          this check is so often made.

*

* Notes:   This is the main routine driving the stemmer.  It goes

*          through a set of suffix replacement rules looking for a match

*          on the current suffix. When it finds one, if the root of the

*          word is long enough, and it meets whatever other conditions

*          are required, then the suffix is replaced, and the function

*          returns.

********** ********* ********* ********* ********* *********/   

  private String replaceEnd(String s, RuleList3[] r)

  {

    int ending;   // set to start of possible stemmed suffix

    boolean changed = false;

   

    for (int i=0; i<r.length && !changed; i++)

    {

      ending = s.length()-1 – r[i].old_offset;

      if ( ending >= 0 )

        if ( 0 == s.substring(ending).compareTo(r[i].old_end))

          if ( r[i].min_root_size < wordSize(s) )

          {

            s = s.substring(0,ending)+r[i].new_end;

            changed = true;

            ruleId  = r[i].id;

          }

    }

    return s;

  }

The stem() first judges if a word contains non-alphabetic character, and if so, the word won’t be stemmed and returns its original form. Usually, this filters compound words concatenated with hyphen.

Then it just simply run through the RuleLists. replaceEnd() is the main routine of stem(), judging if a rule can be fired and stripping suffix by replacing old ending with new one of the rule if it’s fired.

It seems cumbersome comparing with Porter stemmer [port], which used a pointer to function to implement the firing. The reason is that Java does not support pointer any more.

The stem() also involves some small routine like endsWithCVC() and endsWith_Ce() to judge the ending of a word which usually can be regarded as a condition to fire a rule. Code of these methods can be found in Appendix A.

The calling program of Stemmer is IndexConstructor. In its main(),

 

if (!isEmptyLine(s3))   // remove empth line

{

  s3=parse(s3);         // declare a class Stemmer

  insertDB(s3,s1,block,insertWordStmt,updateFreqStmt);

      // insert terms appeared in the line to index table

}

 

private String parse(String s)

{

    StringTokenizer st=new StringTokenizer(s,” “);

    String word, line = ” “;

 

    while (st.hasMoreTokens())

    {

      word = st.nextToken();

      Stemmer suffixRemover = new Stemmer();

      word = suffixRemover.stem(word);

      line = line + word + ” ” ;

    }

    return line;

}

4.1.3 Indexing – making concordance

4.1.3.1 Inverted index file

An index is a mechanism for locating a given term in a text. There are many ways to achieve this. In applications involving text, the single most suitable structure is an inverted file, used as a concordance.

An inverted file contains, for each term in the lexicon, an inverted list that stores a list of pointers to all occurrences of that term in the main text, where each pointer is , the number of document in which that term appears. This is perhaps the most natural indexing method, corresponding closely to the index of a book and to the traditional use of concordances as an adjunct to the study of classical writings such as Shakespeare’s works.

An inverted file index also requires a vocabulary – alist of all terms that appear in the database. It supports a mapping from terms to their corresponding inverted lists and in its simplest form is a list of strings and disk addresses.

As an example of an inverted file index, consider the Table 3 with the structure in 4.2.

keyword    progId     blockNo    freq

look        baku        16          1

look        baku        18          1

look        clas        0           1

look        mkr0        3           2

lost        baku        8           1

lot         baku        4           1

love        baku        8           1

lucky       baku        14          1

mad         baku        9           1

mad         baku        12          1

magic       baku        5           1

magic       baku        19          1

magic       mkr0        4           3

Table 3 Part of the index table of Subtitles Search System

The inverted filegenerated for the text of this part is shown in Table 4, where the third attribute of each touple contains a pair of program, block number and frequency in this block.

 

Number

keyword

(program, block, freq)

1

look

{(baku,16,1),(baku,18,1),(clas,0,1),(mkr0,3,2)}

2

lost

{(baku,8,1)}

3

lot

{(baku,4,1)}

4

love

{(baku,8,1)}

5

lucky

{(baku,14,1)}

6

mad

{(baku,9,1),(baku,12,1)}

7

magic

{(baku,5,1),(baku,19,1),(mkr0,4,3)}

Table 4 Inverted index file of Table 3

A query involving a single term is answered by scanning its inverted list and retrieving every block that it cites. For conjunctive Boolean queries, i.e. concatenated with AND, the intersection of the terms’ inverted lists is formed. For disjunction, where the operator is OR, the union is taken; and for negation using NOT, the complement is taken.

Since we use database rather than plain text file to store the index of the system, the inverted file illustrated in Table 4 is not 1NF (First Normal Form), i.e. the third attribute of the relation is not atomic. It may cause some complexity of SQL operation. Therefore an index table with the data structure listed in 4.2 is built.

4.1.3.2 Signature index file

A signature file is a probabilistic method for indexing text. Each document has an associated signature, a string of bits that captures in some sense the content of the document. To create the signature for a document, each term in it is used to generate several hash values, and the bits of the document signature corresponding to those values are set to one.

“OR”ing the hast strings or each terms in a document can get the signature of this document. An example is given in Table 5-7.

 

document

text

1.     

Look, Simba. Everything the

2.     

light touches is our kingdom.

3.     

Sarabi and I didn’t see you

4.     

at the presentation of Simba.

Table 5 Documents example

 

and               1000 0000 0010 0100

at                0000 0101 0000 0001

didn’t            0010 0100 0000 1000

everything        0000 1010 0000 0000

i                 0000 1001 0010 0000

is                0100 0000 1000 1000

kingdom           0001 0010 0000 0001

light             0010 1000 0000 0100

look              1000 1000 0100 0000

of                0100 0100 0000 0010

our               0000 0010 0110 0000

presentation      0100 0000 0100 0010

sarabi            1010 1000 0000 0000

see               0011 0000 0100 0000

simba             0001 0101 0000 0000

the               1000 0000 0000 1001

touches           0000 0010 0110 0000

you               0100 0000 1010 0000

Table 6 Hash values of terms

 

document

text

signature

1.     

Look, Simba. Everything the

1001 1111 0100 1001

2.     

light touches is our kingdom.

0111 1010 1110 1101

3.     

Sarabi and I didn’t see you

1111 1101 1110 1100

4.     

at the presentation of Simba.

1101 0101 0100 1011

Table 7 Example signatures

The signature file can facilitate Boolean operation of query terms. For instance, if user search for “didn’t see you” (“didn’t” AND “see” AND “you”), just go to Table 6 and find

didn’t            0010 0100 0000 1000

see               0011 0000 0100 0000

you               0100 0000 1010 0000

 

OR them          0111 0100 1110 1000

Then compare this string with documents’ signatures, and find only document 3 have the signature in which all the corresponding digits are 1. Hence document 3 contains all the three terms searched for.

But for single-term query (or few terms query), signature file is suffered from false matches. If we only look for “see”, both document 2 and 3 match, but 2 is actually a false result.

The signature file method becomes more effective as queries become more specific because queries involving the conjunction of several terms can check more bits in the signature file.

Though single-term queries may seem inadequate in a huge collection such as the Internet, they are nevertheless very widely used. Average queries to web search engines contain only two or three terms. Since TV subtitle search tends to have fewer terms to locate a program clip, this method is not considered in this project though it may be a strong way to index under some circumstances.

Moreover signature file has a fatal disadvantage for handling queries include negated term, i.e. false-match checking in a large undecided set, but no account need be taken of this since subtitle search does not support negate boolean operation.

Another downside of signature file is that it cannot be used to support ranked queries.

4.1.3.3 Building an index table

The emphasis in this section is on building the index table in Microsoft Access. It is what we actually do for the Subtitles Search System.

 

/* create connection and prepare statements */

Class.forName(“sun.jdbc.odbc.JdbcOdbcDriver”);

con = DriverManager.getConnection(“jdbc:odbc:Concordance”, “”, “”);

insertWordStmt=con.prepareStatement(“insert into index (keyword, progId, startTime, blockNo, freq) ” + “values (?, ?, ?, ?, 1);”);

insertWordStmt.setString(2, prog);

 

updateFreqStmt=con.createStatement();

 

insertDB(s3,s1,block,insertWordStmt,updateFreqStmt);

      // insert terms appeared in the line to index table

The inputting parameter of the method insertDB() are a processed dialogue in which each term has been case folded, removed punctuation, stemmed; the starting time of this dialogue; the block number; and two statements.

 

private void insertDB(String s, String tim, int block,

PreparedStatement stmt, Statement updateStmt)

{

    StringTokenizer st = new StringTokenizer(s,” “);

    String word = null, head = ” “;

 

    while (st.hasMoreTokens())

    {

      word=st.nextToken();

      try

      {

        if (head.indexOf(” “+word+” “)>=0) 

        {   // the term already appeared in former part of the dialogue

          updateStmt.executeUpdate(“update index set freq=freq+1 “+

                                    “where keyword = ‘”+word+”‘ “+

                                    “and startTime = ‘”+tim+”‘ ;”);

        }

      else  // insert a new entry

      {

        stmt.setString(1,word);

        stmt.setString(3,tim);

        stmt.setInt(4,block);

        stmt.executeUpdate();

      }

        head = head+word+” “;

      }

      catch (SQLException ex)

      {

        System.err.println(ex.getMessage());

      }

    } // end while

}

When one word appears more than once in a dialogue, the freq will be incremented in order to avoid duplicate occurrence, which is not allowed since the primary key of the index table is (keyword, progId, time).

4.1.3.4 Segmentation – break down a program into blocks

When dealing with large amounts of video, it is desirable to enable access to smaller segments. The problem with segmenting video lies in determining what the segments should represent and where the divisions between segments should occur. This project develops a method whereby it could process a digitized video sequence and segment sizable video sequences into smaller meaningful units (we name them “blocks” in this project) that can be relocated and examined individually.

In order to perform this segmentation, we can exploit regularities that we know to be present within the data. These regularities can be simple things, such as a return to the shot of the news presenter before the introduction to the next story, an interpolated advertising commercials, or a scene shift in movies etc. Often the regularities are cues that can be used to indicate the division points between segments.

A method is constructed to automatically segment large amounts of data by finding these cues and define the points at which the data streams can be broken into smaller blocks.

The distinctiveness of segmentation cues is tremendous influenced by the type of the program. Programs like television news and question-answer style programs such as “Who wants to be a millionaire” have very distinct cues and hence simpler to segment.

However, the Subtitles Search System is expected to construct a general-purpose segmentation method that could be applied to all kinds of TV programs. The commonest feature (the duration of silence time) of division points should be used in this method.

This class import data from pre-processed txt file to an index table in Microsoft Access. The subtitles are divided to several blocks for future search results ranking. A 6 seconds time duration is set for the standard difference between two blocks, i.e. if the difference between the end time of a dialogue and the start time of the next dialogue (the silence between two subtitles) is greater than 6 seconds, a new block starts here.

 

private final float TIME_DIFF = 6.0f;  // time duration between two blocks

 

String line, s1, s2, s3;

int block=0;

float startF=0f, endF=0f, oldEndF=0f;

 

while ((line=fin.readLine())!=null)

{

      s1 = line.substring(TIME1_START,TIME1_END);  // start time

      s2 = line.substring(TIME2_START,TIME2_END);  // end time

      startF = parseTime(s1);

      endF   = parseTime(s2);

      if (startF-oldEndF > TIME_DIFF) block++;

      oldEndF= endF;

      // read dialogue to s3 and process s3

 

      insertDB(s3,s1,block,insertWordStmt,updateFreqStmt);

}

 

/* convert String of time to float value of total seconds */

private float parseTime(String timeString)

{

    StringTokenizer st=new StringTokenizer(timeString,”:”);

    String h = null, m = null, se = null;

 

    float hf = 0f, mf = 0f, sef = 0f;

    h  = st.nextToken();

    m  = st.nextToken();

    se = st.nextToken();

    hf = Float.valueOf(h).floatValue();

    mf = Float.valueOf(m).floatValue();

    sef= Float.valueOf(se).floatValue();

    return hf*3600 + mf*60 + sef;

}

The average time for one block is 44 seconds. This value could be set as the duration of a program clip when a search result is played.

The choice of the value of the “silence” duration between two blocks influences the accuracy of the division. A suitable “silence” value can make blocks approach to natural segmentation, because due to the removing of stop words and then “empty” lines, the “silence” duration may not be a real silence. For example:

0:14:34.6* 0:14:36.0* Hurry, get on it!

0:14:38.2* 0:14:39.8* What about you?

0:14:39.8* 0:14:41.6* I’ll distract the pursuer.

After pre-processing (including stop words removal), it changes to:

0:14:34.6* 0:14:36.0* hurry

0:14:39.8* 0:14:41.6* distract pursue

Every word in the second line is stopped. When the segmentation part is executed, it treats the duration between 0:14:36.0 and 0:14:39.8 as “silence”, which causes a block inserted into them when the “silence” duration is set to a value less than 3.8 second (0:14:39.8-0:14:36.0).

The value of 6 second is selected after a lot of testing to ensure the maximum similarity to the natural segmentation.

4.2 Data structure

Relational data model is used in the Subtitles Search System.

4.2.1 Index table

 

keyword*

progID*

blockNo

time*

freq

 

 

 

 

 

Table 8 Index table design

There are various data structures to store index. In order to use SQL to handle query handling, instead of inverted file index, an index table illustrated in Table 5 is built in the project.

keyword column stores stemmed, non-stopword, lower-case words in subtitles (Figure 7). ProgID column indicates the TV programme ID in which the word appears. The program ID usually takes from the first four letters of the subtitles file name.

 

public IndexConstructor(String inputFile)

{

  int loc = inputFile.lastIndexOf(‘\’);

  prog    = inputFile.substring(loc+1,loc+5);  // program Id

 

  insertWordStmt = con.prepareStatement(

“insert into index (keyword, progId, startTime, blockNo, freq) ” + “values (?, ?, ?, ?, 1);”);

  insertWordStmt.setString(2, prog);

  // ……

}

time column is the starting time of the dialogue the keyword appears. freq column indicates frequency, counting how many times it appears in this dialogue. It is useful in ranking.

The primary key of index table is (keyword, progID, time). Because sorting by keyword can speed searching when a specific term is looked for, and only the pair (keyword, progID, time) can uniquely identify an entry.

Besides inverted file index and the index table of this system, storing a lexicon as string pointers, front coding of shared common prefix, and minimal perfect hashing can also be used as data structure of index file to save memory space in conventional text retrieval systems.

4.2.2 Granularity of index table

Granularity of an index is the accuracy to which it identifies the location of a term. A coarse-grained index is used in the Subtitle Searching System since the start time of a program clip for playback should be a little bit earlier than the time found in index table, the exact location is senseless therefore. Though it might be used in ranking measurement.

A fine-grained index can like this:

 

keyword*

progID*

time*

freq

nth_word

 

 

 

 

 

Table 9 Fine-grained index table

The attribute nth_word indicates the exact location of the keyword in the dialogue, which can be used to calculate data relevancy involving adjacency in ranked query.

The granularity of the index in SSS is block-level considering the efficiency of processing and ranking.

4.2.3 Program table

Another important table for video clips retrieval of Subtitles Search System is the program table. It contains information of every TV program in the collection of this system and is used to find the corresponding video file name and program title.

The data structure of this table is shown in Table 10.

 

ProgID*

title

duration

number_of_blocks

fileName

 

 

 

 

 

Table 10 Program table data structure

4.3 Implementation of administration part

The Adequacy.java is activated by the form on administration page. It executes a system copy command to copy mpg file to video collection directory, updates the program table by inserting the new program information, and indexes the subtitles file using IndexConstructor.class.

 

/* part of Adequacy.java */

String subtFile = request.getParameter(“subtitlesFile”);

String videoFile= request.getParameter(“videoFile”);

String title = request.getParameter(“title”);

Connection con  = null;

PreparedStatement insertTitle = null;

 

String[] cmd = new String[6];

cmd[0] = “cmd”;

cmd[1] = “/c”;

cmd[2] = “copy”;

cmd[3] = “/B”;

cmd[4] = videoFile;

int loc=videoFile.lastIndexOf(‘\’);

String videoFileName = videoFile.substring(loc+1);

cmd[5] = “C:\Documents and Settings\n0700958\mov\” + videoFileName ;

 

Runtime runtime = Runtime.getRuntime();

Process process = runtime.exec( cmd );  // system copy command

 

IndexConstructor index = new IndexConstructor(subtFile);  // index subtitles

 

try

{

  Class.forName(“sun.jdbc.odbc.JdbcOdbcDriver”);

  con = DriverManager.getConnection(“jdbc:odbc:Concordance”, “”, “”);

insertTitle = con.prepareStatement(“insert into program (title,progId,fileName) values (?,?,?);“);  // update program table

  insertTitle.setString(1,title);

  insertTitle.setString(3,videoFileName);

 

  loc = subtFile.lastIndexOf(‘\’);

  String prog = subtFile.substring(loc+1,loc+5);

  insertTitle.setString(2,prog);

 

  insertTitle.executeUpdate();

}

4.4 Process query terms

The query types for the Subtitles Search System are similar to web search requests consisting just of a few terms of a phrase, though no exact phrase search supported, has average 3.0 terms per query.

Since the default boolean operation of Subtitles Search System is “OR” and no other boolean operation is supported, single phrase searching, for example “tea time”, is regarded as “tea” OR “time”.

User’s query is processed before sending to search engine. The query processing involves case folding, stemming like we did in subtitles preprocess.

No words stopped here since stop words are already removed from index table. Stop words in query cannot be found in index and hence it’s not necessary to stop in query processing. Either stopping in subtitles preprocessing or in query terms processing is enough. Obviously, stopping words in subtitles processing is more efficient and economic.

In the sense, boolean operation AND, OR, NOT are all stopped. The conventional boolean operation symbols like +, (, ) are also cleared out as punctuation in query processing. But – is usually treated as hyphen and is not removed if no space before or after it, for instance if “tea-time” is a query term, the exact word “tea-time” is searched; if “tea – time” is the query, “tea” OR “time” are searched and the – is ignored.

 

<form action=’stem.jsp’ method=’get’>

    <TABLE cellSpacing=0 cellPadding=5 border=0>

      <TBODY><TR>

        <TD id=t3><B><I>Search&nbsp;</I></B></TD>

        <TD><INPUT id=input maxLength=300 size=30 name=”originalWord”> </TD>

        <TD><INPUT type=’submit’ height=19 value=’Search programs’ width=81 border=0>

      </TD>

      </TR></TBODY>

    </TABLE>

  </form>

 

/********* stem.jsp *********/

<html>

<body bgcolor=”white”>

<%! String root; %>

<jsp:useBean id=”s” scope=”page” class=”eunice.Stemmer” />

<jsp:setProperty name=”s” property=”originalWord” />

 

the original word is

<% out.println(s.getOriginalWord()); %>

<br>

 

The stemmed root (using bean) is

<%

  root = s.getRoot();

  out.println (root);

  response.sendRedirect(“http://localhost:8080/examples/SearchEngineServ?mt=”+root);

%>

</body>

</html>

In the search html page, when user inputs a query and clicks the search button, it triggers stem.jsp, which then pass the query terms to eunice.Stemmer and gets processed roots of query terms. eunice.Stemmer contains a method processing the query (including case folding, remove punctuation and stemming). The jsp then activates search engine servlet and passes the roots to it.

4.5 Searching (search engine servlet)

There are two working tables playing important roles in searching of the Subtitles Search System: foundWords table and rankingInfo table.

foundWords table stores the result of found keywords in the index table, its program Id, blockNo and frequency in the dialogue it appears. For instance, if the user queries “feed it and protect it”, only “feed” OR “protect” are searched since the other terms are regarded as stop words and do not exist in the index. Table 11 lists the result of those two terms.

 

foundWords

keyword

progId

blockNo

freq

feed

baku

5

1

feed

baku

11

1

feed

baku

16

1

protect

mkr0

17

1

protect

baku

11

1

protect

baku

24

1

Table 11 An example of foundWords table

 

PreparedStatement seleStmt = con.prepareStatement(

      “select keyword,progId,blockNo,freq from index where keyword=?;”);

PreparedStatement inseStmt = con.prepareStatement(

      “insert into foundWords values (?,?,?,?);”);

 

StringTokenizer st = new StringTokenizer(keywords,” “);

while (st.hasMoreTokens())   

{

      word=st.nextToken();

      seleStmt.setString(1,word);

      ResultSet rs = seleStmt.executeQuery();

      while (rs.next())

      {

          String prog   = rs.getString(“progId”);

          int block     = rs.getInt(“blockNo”);

          int frequency = rs.getInt(“freq”);

          inseStmt.setString(1, word);

          inseStmt.setString(2, prog);

          inseStmt.setInt   (3, block);

          inseStmt.setInt   (4, frequency);

          inseStmt.executeUpdate();

      }

      if (rs!=null) rs.close();

}

This piece of code searches for some keywords and writes the result into the table foundWords.

rankingInfo table groups the results in foundWords by keyword, progId and blockNo and sums their frequency and block weight for the keyword. Because there is no term that appears more than once in a same block of a same program in foundWords, the entry number of rankingInfo is same as that of foundWords.

 

rankingInfo

keyword

progId

blockNo

sumOfFreq

weightInBlock

feed

baku

5

1

2.40

feed

baku

11

1

2.40

feed

baku

16

1

2.40

protect

baku

11

1

2.40

protect

baku

24

1

2.40

protect

mkr0

17

1

2.40

Table 12 An example rankingInfo table

 

/*** input table: foundWords, output table: rankingInfo ***/

  private void groupBlock(Connection con)

  {

String sql=”SELECT DISTINCTROW keyword, progId, blockNo, Sum(freq) AS sumOfFreq “+ “FROM foundWords GROUP BY keyword, progId, blockNo;“;

    Statement stmt = null;

    PreparedStatement inseStmt = null;

    ResultSet rs = null;

 

    try

    {

      stmt=con.createStatement();

inseStmt = con.prepareStatement(

“insert into rankingInfo values (?,?,?,?,?);”);

      rs=stmt.executeQuery(sql);

      String word = null;

      int sumOfFreq = 0;

      while (rs.next())

      {

        word = rs.getString(“keyword”);

        sumOfFreq = rs.getInt(“sumOfFreq”);

 

        inseStmt.setString(1,word);

        inseStmt.setString(2,rs.getString(“progId”));

        inseStmt.setInt(3,rs.getInt(“blockNo”));

        inseStmt.setInt(4,sumOfFreq);

        inseStmt.setDouble(5,calcWordWeight(con, word)*sumOfFreq);

 

        inseStmt.executeUpdate();

      }

    }

    //……

  }

We may notice that the block 11 of program “baku” contains both “feed” and “protect” (the highlighted rows in table 12). Table 13 shows the result of grouping by program ID and blockNo and ordering by descending relevance.

rankBlocks is not a real table, but a ResultSet object. The SQL generateing rankBlocks is like the following:

 

SELECT progId, blockNo, Sum(weightInBlock) AS wordsWeight

FROM rankingInfo

GROUP BY progId, blockNo

ORDER BY Sum(weightInBlock) DESC;

 

rankBlocks

progId

blockNo

wordsWeight

baku

11

4.79579054559674

baku

24

2.39789527279837

mkr0

17

2.39789527279837

baku

16

2.39789527279837

baku

5

2.39789527279837

Table 13 An example of ResultSet rankBlocks

 

/********* input table: rankingInfo *********/

  private String rankBlocks(Connection con)

  {

    Statement stmt      = null;

    ResultSet rs        = null;

    ResultSet progTimeRs = null;

 

    try

    {

String sql=”SELECT progId, blockNo, Sum(weightInBlock) AS wordsWeight” + “FROM rankingInfo GROUP BY blockNo, progId ORDER BY Sum(weightInBlock) DESC;“;

      stmt = con.createStatement();

 

      rs = stmt.executeQuery(sql);

      //……

    }

    //……

  }

Table 13 lists the ranked resulted of the query “feed it and protect it”. Then the search engine can look for the start time of each entry of rankBlocks in the index table and look for the corresponding video file name and program title in the program table. These steps will be discussed in later sections.

4.6 Ranking

We show that statistical methods developed for text retrieval are also effective for retrieving multimedia documents.

Ranked query is applied in the Subtitles Search System instead of Boolean query. In ranked query, a list of keywords is provided and the system returns a set of blocks in order of decreasing relevance. It measures the similarity of each result to the query.

4.6.1 Ranking strategies

4.6.1.1 Coordinate matching

Coordinate matching counts the number of query terms that appear in the document (i.e. block in SSS). There are two drawbacks of this method.

First, long documents are favored over short ones since by virtue of length alone they are more likely to contain a wide selection of the query terms. This problem exists in TV Subtitles Search System too due to the uncertain length that a block might have. That is some blocks might be very long cause no obvious “silent” duration within it, or very short cause two many “silence” within it.

Second, common terms (not stop words) appearing in the query tend to discriminate unfairly against documents that do not happen to contain them. This is solved in the Subtitles Search System by word weighting.

4.6.1.2 Statistical method

This method overcomes the drawbacks of coordinate matching. It takes the document length and term’s weight into account. Cosine measure is a typical statistical method but not suitable for this system due to the data structure of Subtitles Search System.

To measure similarity of a result with a query in this system, three problems need to be considered: term frequency, weight and adjacent.

Weight should indicate the term scarcity. Normally the fewer a word appears in a document, the greater its weight is. A document with enough appearances of a common term (not stop word) will always be ranked first if the query contains that term, irrespective of other words. Therefore emphasis should be given to scarce terms. This can be achieved by weighting terms according to their frequency, i.e. the weight of a term can be calculated as

Weight = 1/freq

If a term with freq=1, this formula calculates its weight as twice as a term with freq=2. To avoid this, we revise the formula as

Weight = ln(1+N/freq)

Equation 1

Where N is the number of blocks in the collection, i.e. the number of blocks of a TV program. For example, if a program has 30 blocks, the weight of a term with freq=1 is 3.434, while the weight of a term with freq=2 is 2.773.

An alternative way to use words’ weight based on a complete corpus of English language. This may save the time on weight calculation by searching the weight of a specific word from an existing words weight database. However, calculation on a parallel text corpus with videos can indicate a word’s weight in its context more precisely.

The ranked query in the Subtitles Search System is actually a Boolean “OR” operation with applying the statistical ranking strategy into the results.

4.6.2 Implementation of statistical ranking strategy

The following method calcWordWeight() implements the statistical ranking method in Equation 1.

 

private double calcWordWeight(Connection con, String word)

{

    int numberOfBlocks=0;  // number of blocks in a program

    int wordTotFreq=0;     // total frequency of a word in the program

    double weight=0d;      // weight of the parameter word

 

    Statement stmt = null;

    PreparedStatement findTotFreqStmt = null;

    ResultSet rs = null;

    try

    {

      String sql = “select blockNo from index order by blockNo DESC;”;

      stmt = con.createStatement();

      stmt.setFetchSize(1);

      rs = stmt.executeQuery(sql);

      if (rs.next())

        numberOfBlocks = rs.getInt(“blockNo”) + 1;  //blockId starts from 0

 

      sql = “SELECT DISTINCTROW keyword, Sum(freq) AS sumOfFreq “+

            “FROM foundWords WHERE keyword=? GROUP BY keyword;”;

      findTotFreqStmt = con.prepareStatement(sql);

      findTotFreqStmt.setString(1,word);

      rs = findTotFreqStmt.executeQuery();

      if (rs.next())

      {

        wordTotFreq = rs.getInt(“sumOfFreq”);

      }

      else

        System.err.println(“No index entries found”);

    }

    // catch()

 

    weight = Math.log(1 + ((double)numberOfBlocks)/wordTotFreq );

    return weight;

}  // end calcWordWeight()

Setting numberOfBlocks to the maximal blockNo+1 is caused from the fact that blockNo in the index table starts from 0.

 

/*** input table: foundWords, output table: rankingInfo */

private void groupBlock(Connection con)

{

String sql=”SELECT DISTINCTROW keyword, progId, blockNo, Sum(freq) AS sumOfFreq “+ “FROM foundWords GROUP BY keyword,progId,blockNo;”;

    Statement stmt=null;

    PreparedStatement inseStmt=null;

    ResultSet rs=null;

 

    try

    {

      stmt = con.createStatement();

inseStmt = con.prepareStatement(

insert into rankingInfo values (?,?,?,?,?);“);

      rs = stmt.executeQuery(sql);

      String word=null;

      int sumOfFreq=0;

      while (rs.next())

      {

        word=rs.getString(“keyword”);

        sumOfFreq=rs.getInt(“sumOfFreq”);

 

        inseStmt.setString(1,word);

        inseStmt.setString(2,rs.getString(“progId”));

        inseStmt.setInt(3,rs.getInt(“blockNo”));

        inseStmt.setInt(4,sumOfFreq);

        inseStmt.setDouble(5,calcWordWeight(con, word)*sumOfFreq);

        inseStmt.executeUpdate();

      }

    }

    //……

}

In method groupBlock(), calcWordWeight() is called to calculate a block weight for a specific keyword. This value will be ordered later.

4.7 Playback

In this section, two kinds of media API will be discussed. Both of them are tried in the developing phrase. Java Media Framework is used at first, and Windows Media API is adopted finally due to its simple usage.

Though it’s not used in the final system, we will discuss JMF since we did use it to implement some working control parts and it can do more and deeper compared with Windows Media, just like the everlasting contentions on Windows and Unix operating systems.

4.7.1 JMF (Java Media Framework) 2.1.1

JMF API supports capturing, playback, streaming and transcoding of audio, video and other time-based media. It also provides a plug-in architecture enabling developers and technology providers to customise and extend JMF functionality.

4.7.1.1 Data sources

JMF data sources can be categorized according to how data transfer is initiated: Pull Data-Source and Push Data-Source.

In Pull data-source, the client initiates the data transfer and controls the flow of data from pull data-sources. Established protocols for this type of data include HTTP and FILE.

In Push data-source, the server initiates the data transfer and controls the flow of data from a push data-source. Push data-sources include broadcast media, multicast media, and video-on-demand. For broadcast data like TV programmes, one protocol is the Realtime Transport Protocol (RTP).

In this project, pull data-source is used in playback programmes clip part instead of push data-source because TV signals are captured and saved to files using Hauppauge TV card and processing and searching are all based on these files, not on TV broadcasting itself.

The reason to distinguish these two data source is that user‘s control level depends on the type of data source being presented. For example, an mpg file can be repositioned, fast-forwarded or replayed while broadcast media is under server control and cannot be repositioned (except some Video-On-Demand protocols).

Since the parts of processing and searching are based on media FILEs and their corresponding subtitles FILEs instead of broadcast media, users have full control on the media playback.

 

<HTML>

  <HEAD>

    <TITLE>Video Applet</TITLE>

  </HEAD>

<BODY>

    <APPLET CODE=TVPlayer WIDTH=240 HEIGHT=180>

        <PARAM NAME=FILE VALUE =”mov/CINDY.MOV”>

        <param name=time value=”4.0d”>

    </APPLET>

</BODY>

</HTML>

 

/********* TVPlayer.java *********/

String mediaFile = null;  // URL for media file

URL url = null;           // URL for doc containing applet

URL codeBase = getDocumentBase();

 

/* Get the media filename info. The applet tag should contain the path to the source media file, relative to the html page. */

 

if ((mediaFile = getParameter(“FILE”)) == null)

      Fatal(“Invalid media file parameter”);

try

{

      // Create an url from the file name and the url to the

      // document containing this applet.

      if ((url = new URL(codeBase, mediaFile)) == null)

        Fatal(“Can’t build URL for ” + mediaFile);

 

      // Create an instance of a player for this media

      if ((player = Manager.createPlayer(url)) == null)

        Fatal(“Could not create player for “+url);

}

TVPlayer.java gets file name (and path) from the calling applet and create a player on that data source.

4.7.1.2 Frame Positioning and clip locating

JMF supports frame positioning by implementing the FramePositioningControl. User can seek to a particular frame of a video. This enables user to easily set the start position to the beginning of particular frame without having to specify the exact media time that corresponds to that position.

To set the frame position, the FramePositioningControl seek method is called. When you seek to a frame, the Player object’s media time is set to the value that corresponds to the beginning of that frame and a MediaTimeSetEvent is posted.

Some Players can convert between media times and frame positions. You can use the FramePositioningControl mapFrameToTime and mapTimeToFrame methods to access this information. There is not a one-to-one correspondence between media times and frames — a frame has a duration, so several different media times might map to the same frame.

This feature is useful for locating clip in the Subtitles Search System.

The locating function is implemented by the method setMediaTime() in the system.

 

if ((timeStr = getParameter(“time”)) == null)

      Fatal(“Invalid starting time parameter”);

    else

    {

      double startTime = (new Double(timeStr)).doubleValue();

      t = new Time(startTime);

    }

 

public synchronized void controllerUpdate(ControllerEvent event)

{

    if (player == null) return;

    if (event instanceof RealizeCompleteEvent)  // Prefetching

    {

      if ((visualComponent = player.getVisualComponent()) != null)

        add(“Center”, visualComponent);

      if ((controlComponent = player.getControlPanelComponent())!=null)

        add(“South”,controlComponent);

      validate();

    }

    else if (event instanceof PrefetchCompleteEvent)  // Prefetched

    {

      player.setMediaTime(t);  // first playback

      player.start();

    }

    else if (event instanceof StartEvent)

    {

      player.setMediaTime(t);  // when “play” button clicked

    }

    else if (event instanceof EndOfMediaEvent)

    {

      player.stop();

      player.deallocate();

    }

}  // end controllerUpdate()

TVPlayer gets the time parameter from html, converts it to a Time object, and sets media time when the PrefetchCompleteEvent or StartEvent occurs. The former event occurs when the player plays the video in the first time; and the latter occurs when user clicks the u button on control panel.

This ensures that whether the player plays a video in the first time or replays, it always starts from the located position where the query terms appear. However, in the final version implemented by Windows Media Player, only the first time is started from the located position. If user replays the video, it will start from the beginning of the program.

4.7.1.3 Compositing

Compositing time-based media is the process of combining multiple tracks of data onto a single presentation medium. For example, overlaying text on a video presentation is one common form of compositing. A device that performs compositing can be abstracted as a renderer that can receive multiple tracks of input data.

Compositing can be used in this system to display subtitles overlay on video clips (not achieved in this version, is left to follow-on works).

4.7.2 Windows Media Player

The most attractive advantage of Windows Media Player is easy to use. An html code like the following can easily playback a video file on web page. The following html is generated by the Playback.class according to user’s selection on result page.

 

<html>

<head>

  <title>search TV programs</title>

</head>

<body bgcolor=”white”>

<OBJECT ID=WinMedia width=240 height=250 classid=CLSID:22D6F312-B0F6-11D0-94AB-0080C74C7E95 codebase=http://activex.microsoft.com/activex/controls/mplayer/en/nsmp2inf.cab#Version=5,1,52,701 standby=Loading Microsoft® Windows® Media Player components… type=application/x-oleobject>

  <PARAM NAME=FileName VALUE=mov/class.mpg>

  <PARAM NAME=AutoStart Value=True>

  <PARAM NAME=ShowStatusBar VALUE=True>

  <param name=currentPosition value=58.6>

</OBJECT>

</body>

</html>

 

/***** part of Playback.java *****/

public void doGet(HttpServletRequest request, HttpServletResponse response)

    throws ServletException, IOException

{

    String startTime = request.getParameter(“startTime”);

    String videoFile = request.getParameter(“file”);

 

    response.setContentType(“text/html”);

    PrintWriter out = response.getWriter ();

    out.println(“<html>”);

    out.println(“<body>”);

    out.println(“<head>”);

    out.println(“<title>search TV programs</title>”);

    out.println(“</head>”);

       

    out.println(“<body bgcolor=”white”>”);

out.println(“<OBJECT ID=WinMedia width=240 height=250 classid=CLSID:22D6F312-B0F6-11D0-94AB-0080C74C7E95 “+

   “codebase=http://activex.microsoft.com/activex/controls/mplayer/en/ nsmp2inf.cab#Version=5,1,52,701 “+

   “standby=Loading Microsoft® Windows® Media Player components… type=application/x-oleobject>”+”n”+

        <PARAM NAME=FileName VALUE=mov/”+ videoFile + “>n” +

        <PARAM NAME=AutoStart Value=True>n” +

        <PARAM NAME=ShowStatusBar VALUE=True>n” +

        <param name=currentPosition value=”+startTime +”>n” +

      “</OBJECT>”);

 

    out.println(“</body>”);

    out.println(“</html>”);

}

The part of SearchEngineServ that activates Playback is the following code. It generates the result page in which every result is a hyperlink to the Playback class with the parameters of its video file name and start time.

 

/***** part of SearchEngineServ.java *****/

while (rs.next()) 

{

      findTimeStmt.setString(1,rs.getString(“progId”));

      findTimeStmt.setInt(2,rs.getInt(“blockNo”));

 

      progTimeRs = findTimeStmt.executeQuery();

      if (progTimeRs.next())

      {

        String prog     = progTimeRs.getString(“progId”);

        String searchRst= progTimeRs.getString(“startTime”);

        String tim      = searchRst.substring( searchRst.lastIndexOf(‘:’)+1);

progAndTime = progAndTime+”<br>n <a href= http://localhost:8080/examples/Playback?startTime =” + tim +

            “&file=class.mpg>” + prog + ” ” + searchRst + “</a>“;

      }

      else

          progAndTime=”No index entries found”;

}

4.8 Code reusability

Many common information retrieval tasks have been solved to an acceptable degree by previous work and should be reused. Instead of writing a new part of stemmer, or  search engine, or player from scratch, we should have available a store of reusable tools that can be plugged into our new systems with minimal effort. Such reuse is much less common than it should be, often because of installation and integration problems that have to be solved afresh in each case.

In the development of the Subtitles Search System, we implemented the stemmer based on the C++ code of Porter stemmer [port]. It saves a lot of efforts on linguistic rules.

The code reusability is considered in this project. Since it is implemented on a java-based hybrid system (java program, servlet, javaBean and jsp), most modules of the system can be interoperably reused to other software.

Specifically, the JavaBeans component module – Stemmer – has a powerful mechanism (setter and getter methods to access properties) for reusing.

It has certainly been generally acknowledged that Java is a more complete environment for developing and reusing code than C++ or other traditional development environments. This is also one of the reasons why it is used as the programming language in this project.