Federated Content Search for Swiss language corpora

CLARIN has been offering a special search engine that enables users to query large amounts of corpora at the same time. This is the Federated Content Search (FCS), a tool that uses endpoints to pool results from language corpora hosted from different CLARIN centres. The endpoints are created by the CLARIN centres themselves. They decide which corpora can be found from the aggregator (the compiling search engine).

This system has proven to enhance and simplify finding relevant textual data, successfully circumvent legal issues preventing corpora to be copied to another location, and improve storage efficiency by decentralizig files to multiple locations.

Although powerful, FCS also has limitations. One such limitation is the missing ranking algorithm, meaning that because a federated search pools query results from different locations it is unable to implement a method to rank the hits it receives. Nonetheless, the power of FCS lies in its size making it the perfect tool for statistical analysis of language data.

In collaboration with the Digital Discourse Lab at ZHAW and the Linguistic Research Infrastrucutre at UZH, CLARIN-CH has developed its own federated content search engine, specifically for Swiss language corpora.

The Swiss version of FCS consists of 24 selected resources from two repositories. 11 corpora are available through the LCP public corpora repository, such as the Text + Berg corpus, whereas the remaining 13 are from the Swiss-AL repository including the Swiss Federal Parliament corpus.

For a visual depiction on how a federated content search is processed, watch the video below: