visual distances between key frames are calculated using the weighted sum of predefined dissimilarity functions for each descriptor. Similarities between shots key frames are determined by hierarchical cluster methods. Therefore, the video sequences are segmented in shots and shot a key frame is generated to which the visual descriptors are applied. In addition to that, the motion features are applied to the video sequence. In order to compare sequences with different lengths, the motion activity is transformed into the frequency domain. The similarity of the motion activity is determined by the normalized cross correlation function Rxy(t) at t=0. The motion direction characteristics of each video are captured by histograms containing quantized angles of motion vectors. The dissimilarity of two video sequences is measured by the weighted sum of L1-norms of direction histograms. Textual similarities of video sequences are measured by applying probabilistic latent semantic analysis (pLSA) to keywords, descriptions, and comments to determine similarities of associated metadata. From these metadata, document-term matrices are extracted after the elimination of stop-words. The probabilistic latent semantic analysis reduces the document-term matrix of a video sequence to few concepts. Textual similarities between metadata are then measured by cosine
similarity of these concept vectors. The weighted sum of the distance matrices is used for fusion of textual and visual similarity. The calculated distances can be visualized by triple usage of the Fastmap algorithm to generate three coordinates per data point.


