A Brief History of dexonline

Abstract: This contribution introduces the monolingual dictionary dexonline, the most successful collaborative project in Romanian e-lexicography to date, described from the perspective of its own creators. We provide details about the team behind the project and the principles guiding their work, and touch upon some peculiar aspects in the making of this dictionary, such as the relationship between amateur and mainstream lexicography, the interactions with dictionary users and their consultation needs, as well as future directions in the development of this project.

Keywords: online dictionary, e-lexicography, definition, full-text search, fuzzy search

1 Introduction

dexonline is an online monolingual Romanian dictionary. Its name is partly derived from the abbreviation DEX, which stands for “Dicționar explicativ al limbii române” (Explanatory Dictionary of the Romanian Language), currently the most popular monolingual print dictionary in Romania. In dexonline, however, definitions are digitized from a variety of notable printed sources, and their format has been kept as in the original.

The project began in 2001 and has since run on custom-made software. As of May 2019, it has reached 3.3 million visitors/month and 15 million monthly page views (12-month averages). The dictionary contains 890,000 definitions from 80 different sources (of which 28 are fully digitized). We started by digitizing the 1998 edition of DEX, the Romanian Explanatory Dictionary published by the Iorgu Iordan Institute of Linguistics of the Romanian Academy; the effort took three years. This was followed by several relational and morphologic dictionaries (synonyms, antonyms, orthography). In time we expanded to various niches (music, mythology, aviation, religion, slang) and time periods (from 1929 to present-day neologisms).

Beyond DEX 1998 we did not have a clear direction for adding more sources. Due to copyright issues we focused on dictionaries made using public money. Sometimes we were able to convince authors to grant us publication rights. Other factors we took into account were the availability of a digital format (which drastically reduced the publication time in comparison to a paper format) and the volume of work required. We are currently three years into digitizing the second edition of Micul dicționar academic (The Concise Academic Dictionary), also published by the Iorgu Iordan Institute of Linguistics, and we expect this to take another year.

Our team includes software engineers, librarians, scrabble aficionados, artists, but no trained linguists. We probably grew dexonline in a rather suboptimal way; we could have been quicker, we could have saved ourselves a lot of manual labor, we could have built a more financially lucrative site. But ultimately we were the first people—and the only people so far—who rolled up their sleeves and got to work. And the figures quoted above attest to the usefulness of our work.

Everyone in the dexonline team shares an interest in proper grammar, in etymology, in learning the fun history behind this or that word. But we have no formal training in such matters. Because of this, the website aims to rely on the authority and reputation of the authors of each dictionary in our resource pool.

dexonline has no real finances. Up until 2010 we had no income at all. Nowadays we ask for donations and we run one banner on every page. These cover some of the costs, such as website hosting and digitizing the print volumes, but they are nowhere near enough to pay salaries. That is why we rely on the work done free of charge, by volunteers.

2 Guiding principles

Some of the guiding principles that we have relied upon throughout the years are briefly outlined below (see also Frâncu and Borza 2010).

2.1 We fulfill a need of the Romanian-speaking community

We exist mainly because our users need dexonline and find it useful. dexonline started as a personal project with no intention and no expectation to grow to its current size. But it fulfilled a demand of the online “market”. When the demand switched to mobile devices, we did our best to meet it.

If the Romanian Academy (the Institute of Linguistics, in particular) had created an online dictionary, we would probably not exist now. To be fair, the Academy did create some internal e-dictionaries, including the electronic version of DLR, i.e. the eDTLR project, but, as it turns out, they have no intention give the public access to them. On several occasions, we reached out to the Academy, offering our expertise and even asking them to take over our project officially, but we have never received a positive answer.

2.2 We are the custodians, not the authors of the definitions

We copy definitions ad litteram. We almost never alter them except in cases of obvious typos, and when we do, we tag them as corrected in order to keep track of the changes. Ideally, these corrections would find their way back into the printed versions, but so far nobody has asked us for the list of the errors we amended. We did implement an annotation system, so that we can occasionally offer our own comments in boxed footnotes, as in the two examples below.

Example 1: Entry headed by the noun “pancovă” and comment
Example 2: Entry headed by the adjective “surjective” and comment

The comment in Example 1 refers to the incorrect Hungarian etymon provided in the 1998 and 2009 editions of DEX, although in other Romanian dictionaries the origin of the headword in question had been recorded corectly, e.g., Dicționarul etimologic român (1958-1966) (Romanian Etymological Dictionary), or even earlier, in 1939 in Dicționaru limbii românești (Romanian Language Dictionary). In Example 2, the author of the comment puts the faulty definition down to an incomplete word-for-word translation of its French-language counterpart in Larousse; he then goes on to suggest a clearer definition by providing a full sentence alternative.

People sometimes think of us as “the official Romanian dictionary” and ask us to change or clarify some definition. We always take the time to explain that is not the case. There will always exist missing or inaccurate definitions. Our approach is to try to look for dictionaries that fill the gaps, rather than try to make our own definitions.

Our own contributions are of a different nature. We put a lot of time into better organizing the data, into building a blazing fast website and into giving back to the community. All of these efforts are detailed below.

2.3 We try to answer all our users’ questions and address all their concerns

We frequently receive questions via email and social media. Some questions are easy to answer by pointing the respective user to a definition. Broader questions require more research and we try to do that. Sometimes it takes hours, but the result is worthwhile. This is how some of our expanded linguistic articles were born. For example, there is a lot of confusion surrounding the meaning of bilion (English: billion). Some dictionaries say it means one thousand million, while others say it means one million million. It turns out that Romania never adopted either the long scale or the short scale for large numbers, so confusion often ensues. We collected our observations in an article and we customized our software to point to this article whenever someone searches for a relevant keyword, like bilion. We even convinced European institutions (see the EU’s IATE) to fix the problem in their translations.

People sometimes object to the contents of certain definitions. One objection that is easy to fix is an unattributed trademark for words like teflon or adidas. Once the trademark holder contacts us with proof of the trademark, we simply add a footnote to attribute the trademark. Much more sensitive are the political and social issues surrounding definitions such as țigan (Gypsy), homosexualitate (homosexuality) or penticostal (Pentecostal). These definitions can be outdated or downright offensive, especially in older dictionaries. We made a conscious decision not to censor them, but we annotated them to clarify the appropriate use of those words in modern times.

Example 3. Entry headed by the adjective “penticostal” and comment

As illustrated in Example 3 above, the definition of penticostal from DEX 2009 uses the word sectă (sect), which in Romanian has a strong negative connotation. We nnotated the definition to clarify that Pentecostalism is an officialy recognized religion in Romania.

2.4 We have no boss

This motto started as an inhouse joke because dexonline was initially a pet project. Everyone on the team holds a different job and contributes to dexonline as time permits (even so, by our best estimates, we seem to have accrued over 100,000 man-hours of dexonline work). dexonline has no formal corporate structure. We accept help from wherever we can get it. Some of us contribute as editors, some as software engineers, some as curators for the Word of the Day feature.

But there is some deeper truth to this statement. The best part of having no boss is that nobody can tell us what to do. dexonline’s to-do list keeps growing in time because every feature that we implement (Word of the Day, hangman, etc.) opens up new possibilities. We try to work on features that our users request, because, after all, the entire dexonline enterprise is for the public benefit. But it occasionally happens that highly requested features remain unimplemented for years. People had been requesting a word-of-the-day feature for five years before we finally got around to it.

Having no boss also gives us the freedom to use our own moral compass for some thorny issues such as:

  • Keeping old definitions around, even when they are decried as derogatory. We decided that keeping a repository of past language is like holding a mirror to society’s face. Wiping out that repository makes us prone to repeating the same mistakes.
  • Refusing to censor dexonline in any way, even though a significant segment of our users are children. We decided that censorship in itself is more abhorrent than any explicit concepts that some definitions might refer to. It is unclear whether exposing children to concepts unsuitable for their age has any harmful effects. On the other hand, raising people up amidst censorship has well-documented and long-lasting effects.
  • Occasionally, voicing our concerns about political issues. We have not done this frequently, but whenever we felt that the rule of law itself was under attack in Romania, we wrote opinion pieces featured prominently on every page of our site. Sadly, we felt this to be necessary several times in the past decade.

2.5 Most of our data is free/We give free access to most of our data

The vast majority of dexonline’s data, as well as all its source code, are available for download under the GNU General Public License. Any document or program distributed under this license is called free. To clarify, it is free as in free speech, not free as in free beer1. This license offers dexonline users four fundamental freedoms:

  1. the freedom to use the data for any purpose,
  2. the freedom to change the data to suit your needs,
  3. the freedom to share the data with your friends and neighbors, and
  4. the freedom to share the changes you make.

There are many reasons for adopting this license. Chief among them, to the best of our knowledge, dexonline has the largest data set of Romanian definitions. Attempting to curtail this data set by means of copying restrictions would be barbaric. To prove our point, for the first few years of dexonline, before we started publishing our data, we had a lot of trouble with bots – automated visitors that attempted to download every single definition. On several occasions they brought the server to its knees. It became apparent that people value our data set and want a copy, for whatever reasons. We decided to turn the sharing of culture and the dissemination of information into our allies, not our enemies.

While some freeloaders simply used our data to set up their own dexonline clone and try to monetize it, others built on it and gave it back to the community. For example, there are dexonline apps available for every kind of smartphone and operating system. Their developers are not affiliated with dexonline, yet they made better programs than we could ever have, all stemming from the availability of our data set.

3 Moving beyond

By way of conclusion, we would like to think of the future of dexonline. Having this large data set of words and definitions means that we can tinker with it, the only limits being our imagination and our free time.

One of our boldest dreams is to digitize the biggest Romanian dictionary: DLR, process started by the Romanian Academy. The Romanian Academy has an electronic version of DLR. Digitizing it was a long, publicly financed project, but the final result is kept away from public access (the official term is “for research only”).

We also strive to offer our users all the advantages that the digital medium has over paper dictionaries:

  1. Faster searches. Punching a word into a search box takes 1-2 seconds (and we can proudly say that our server then responds in less than one second with the results). Browsing a printed book looking for the same word can take 10 seconds.
  2. Full text searches. Cannot remember what title they gave the Spanish king’s children? Just search for copil rege Spania (child king Spain) and tick the “full-text search” checkbox.
  3. Suffix searches and other regular expression searches. *uar will match words ending in -uar, while zg?r* will match any word with z, g and r in the first, second and fourth position.
  4. Simultaneous searches in all the dictionaries at once.
  5. Searching for an inflected form when the base form is unknown. For example, users can look up the plural inflected form semințelor if they are not familiar with the lexeme sămânță (seed). This also means we can let the users click on any word in a definition and take them to that word’s definition.
  6. Approximate results when the user types in an incorrect word: patinuar, instead of patinoar. When the intention is clear, we can even redirect the user outright, for example, when the user types in the incorrect form repercursiune he/she is automatically redirected to repercusiune.

Currently our definitions are just loosely formatted blobs of text. Moving forward, we are trying to add more structure, isolating the meanings (in a hierarchical tree), usage examples, expressions, semantic relationships and etymology. Some examples are urs and jar. This is a phenomenal improvement over a plain list of definitions of the word from 10-20 sources, which inherently contains a large degree of redundancy and fails to emphasize and reconcile the differences between the definitions. As of May 2019, we estimate to have completed approximately 20% of the information structuring task.

We are aware of the playful component of learning a language, so we provide a short list of word games (Hangman, Words Scramble, Word Mill) that we would like to expand in the future.

References

Cătălin Frâncu and Radu Borza (2010) “Inițiative lexicografice colaborative. Cazul DEX online”. Presentation delivered at the conference ConsILR Resurse lingvistice și instrumente pentru prelucrarea limbii române [Linguistic resources and instruments for processing the Romanian language], Muzeul Național al Literaturii Române, 6-7 May 2010.

Dictionaries

Dicționarul explicativ al limbii române (2nd edition, revised). 2009, author: Romanian Academy. Bucharest: Editura Univers Enciclopedic Gold.

Dicționarul explicativ al limbii române (2nd edition). 1998, author: Romanian Academy. Bucharest: Editura Univers Enciclopedic.

Dicționarul etimologic român (1958-1966), author: Alexandru Cioranescu. Tenerife: Universidad de la Laguna.

Dicţionarul limbii române. Serie nouă. 1965 –2010. Bucharest: Editura Academiei Române.

Dicționaru limbii românești, 1939, author: Artur Scriban. Bucharest: Institutu de Arte Grafice „Presa bună”

Endnote

1 This overlap of meanings does not occur in Romanian or in other Romance languages, which is why we hackers sometimes refer to free software as libre software, as opposed to gratis software. In this context, free refers to freedom, not price.

Abonează-te la newsletter

Abonează-te la newsletter