What is Gamgambá?
Gagambá is an automated spidering program that scours the Web for text in specific languages. With this application, building large corpora of written language becomes extremely easy in a short period of time. From those corpora, language-specific word lists can be built.
How does it work?
Perl users, see the quick reference.
The program is a PERL program that runs via a command line prompt using various options set in a configuration file. The user specifies a number of key search words that will hopefully generate a larger list of sites to visit. With those keywords, the program queries Google's index for a list of matching sites. From there, the spider visits each of those sites collecting the text if it matches certain conditions, but more importantly it adds all the links from each site to the list to visit.
The user has the ability to specify conditions on matching text. The spider runs line by line through the text of a site, and it keeps a given line of text depending on if it meets a threshold condition. The threshold condition is user-defined, and is either a maximum or minimum number of words contained in a number of word lists. For example, a user could discard lines of text that contain more than a certain percentage of English words.
Currently, we're using Gagambá to build corpora and word lists for languages including: Somali, Kirundi, Tagalog, Basque, among others. The word lists can be downloaded. Contact us if you could use them.