Projects
Report on Multilingualism on the Internet
As one of the inputs to the World Summit on the Information Society (Geneva, 2003), the UNESCO Institute for Statistics is preparing a report regarding the status of multilingualism on the Internet, under Initiative B@bel.
- the change in the balance of languages on the internet over time,
- the potential dominance or regression of English on the internet,
- the exploration of some methodological work for assessing linguistic diversity on-line.
As a communications medium, the internet is rather complex: it is large, decentralized in structure, offers varied communications modes, and is rapidly changing in all its dimensions. The size and decentralized structure of the internet complicate the sampling procedures that one must use to comprehensively survey language use. In addition, the structure of linkage among users, sites, countries, etc. becomes a central issue, since these linkages determine what an individual can and will access. Technical differences among communications modes require different survey methods for each one investigated. While comprehensive data archiving efforts are underway for the World-Wide Web, no such efforts appear to have been undertaken for interactive chat modes of communication, where multilingualism has a markedly different character.
Thus the analytical report will adopt a two-pronged approach:
- First, it is necessary to survey what is known about the state of the world’s languages on the internet through existing sources, especially academic literatures, marketing reports, news releases, and technical reports of the internet’s structure, organization and function. Since many of these sources do not directly address questions of multilingualism and linguistic dominance, a critical review must be conducted to identify the best possible current understanding of the distribution of the world’s languages on the internet. Particular attention needs to be paid to the inter-connections of sites and countries, and potential effects on users’ experience of and exposure to different languages.
- Second, it is necessary to survey the technical and challenges for a truly comprehensive survey of multilingualism on the internet, using automatic means. Technologies are available for automatic language identification, but these have never been used on the scale that would be required for a comprehensive survey of languages on the internet. In addition, since the languages of the world number some several thousand, with hundreds of languages in written form we cannot know in advance of the survey just what languages we might find. This diversity of languages also poses other problems of a technical nature in their automatic identification, as there are varying degrees of linguistic difference among any two languages, and any one language might have several different electronic forms in which it is regularly used. Hence, it cannot be predicted how well the existing language identification technologies will perform, or what direction they would need to be developed for this purpose, without further study. In addition, the sources of internet communication data for this analysis need to be located or developed.
| >> | |
| Photos | >> Go to Photobank |
| Contact First Name | Diane |
| Contact Last Name | Stukel |
| Contact E-mail | d.stukel@unesco.org |




