Implementation notes: Semantic MediaWiki

From CODECS Dev
Semantic MediaWiki (SMW), offers four different search engines or search engine configurations:
  1. SQL with the standard setup of SQLStore
  2. SQL with Full-Text Search (FTS) enabled for the SQLStore, which is still considered experimental
  3. Elasticsearch with ElasticStore
  4. SPARQL SPARQLStore (ignored below)

While effort has been made to make the query syntax as system-agnostic as possible, there are still many differences in behavioural outcome and in the range of features available. For this extension it means that its implementation of matching strategies (filters) must take into account. A particular challenge with Full-Text Search is that it works conditionally in that it may fall back to regular SQL (LIKE) behaviour.

SQL (standard)

SQL with Full-Text Search

SQL's Full-Text Search (FTS) is not an ideal approach for powering autocompletion, but it offers a couple of advantages over SMW's standard search behaviour, notably case and accent folding. Some effort has been made to harness its strengths in a way that is hopefully as user-friendly as possible.

When FTS is enabled for SQL (https://www.semantic-mediawiki.org/wiki/Help:Full-text_search), SMW supports two modes of behaviour for string searching, each represented by a different prefix as the operator placed after :::

  1. standard behaviour, now represented by like: rather than the standard tilde (~). E.g. [[Has title::like:Moun*]]
  2. Full-Text Search, now represented by the tilde (~). E.g. [[Has title::~Moun*]]. Additional special syntax includes +/- (IN BOOLEAN MODE) and double quotes for exact phrase matching.

What does this mean for FTS in a search box?

  • Each word, or each new set of consecutive characters, that the user starts typing is evaluated individually: if its character length is at least the number configured through $smwgFulltextSearchMinTokenSize (default: 3) and is not a stopword, it will be matched against a token. If it is shorter, or indexed as a stopword, it may be ignored. [...]
  • A side-effect of tokenisation with FTS is that by default, the order of appearance is not taken into account: notably, it is not possible to match the full string only at its beginning.
  • However, phrase matching is possible, to an extent, by putting the phrase between double quotes. Again, shorter strings not treated as tokens will be ignored.
    • The trade-off is that it does not support the use of asterisks for partial matching on a token.
  • Another difference is in the evaluation of multiple tokens. "Mount Badon" has a match if the string contains either "Mount" or "Badon". To find matches only where both words are included (AND not OR), each token must be prefixed with a boolean plus sign: "+Mount +Badon".

What does this mean for our implementation?

  • It is up to site admins to determine what is most desirable in their use case when it comes to:
    • because the site admin is responsible for the query pattern in the profile, it is also up to the site admin to decide on a tilde or like:.
    • the 'substrpattern', otherwise used to determine the position of asterisks, still needs more thought
      • stringprefix: not supported by FTS.
      • tokenprefix / allchars: there is no such distinction in FTS.
  • Because the ordinary user of a search input should not be expected to be familiar with the nitty gritty of SMW syntax, some behaviours are handled automatically:
    • Asterisks are appended automatically where they are wanted.
    • Boolean prefixes are added automatically. Care is taken that they are prefixed only to actual tokens of the expected length. A fatal error (RuntimeException) will occur if they are added to shorter phrases.
    • Double quotes, however, are