Posts Tagged ‘linguistics’

Paul Payak of the Global Language Monitor is claiming the 1 millionth English word is coming soon.  He says a new English word is coined every 98 minutes, so the 1 million marker will arrive about 15 days hence.  The CBS article that tipped me off to this is pretty amusing in the quotes it selected from linguists, which resoundingly cried “bullshit.”  But the best quote came from Payak himself:

We believe words can be counted if you define them in the right way. You can count them like anything else in science. You can count how many atoms there are in the ocean.

Let’s think about counting the atoms in the ocean for a moment. What about where rivers flow into the ocean? Where is the boundary line? Salt and fresh water are mingling quite a bit and finding the exact boundary is pretty much impossible. If we draw an arbitrary line, surely we will get too much in one place and too little in another. Also, what about rain and evaporation? Counting the atoms would require an instantaneous snapshot of the entire ocean at the atomic level. It can’t be done.

You run into similar problems counting words.  Compound words blend into single words and words leave the language as well as enter it.  How do you detect this?  You’d need a snapshot of the entire English language as it is spoken, typed, and read all around the world.  What is a word in one dialect isn’t necessarily a word in another dialect.  Where do you draw the line?

I just completed my first guest blogging post over at mind x the + gap where I talked about the mutual history of language and commerce, as well as some thoughts on how that will continue into the future. Since the focus of Mil Joshi‘s blog is more towards psychology and economics, the following is a slight adaptation more in line with my normal content.

Commerce is a human convention deeply entwined with language. Economic motivations were among the many reasons ancient (and modern) empires conquered other lands, spreading their languages beyond their natural range. Traders would travel to distant lands, encountering speakers of exotic languages. And where two languages meet, words begin to exchange back and forth. In cases where bilingual speakers were few to none, Pidgin languages developed. Pidgins are languages with simplified grammar and vocabulary, and are never spoken as a first language. They come about as a means of communicating between speakers of different languages for the purpose of trade. When a Pidgin is spoken widely enough that children in the community grow up learning it as a first language, the language changes into a Creole. Creoles have many fascinating characteristics, but the point here is, commerce is a driving factor in their creation. When a conquering empire brings its own language, it either supplants the native language or influences it heavily. Pidgins, on the other hand, develop because speakers are motivated to communicate in order to trade.

Groups of speakers who remain in constant contact tend to speak the same dialect of a language. When a group breaks off and becomes isolated (contact with the original group is infrequent or not widespread), their dialects begin to diverge. Mass communication is changing this landscape, allowing larger and larger people groups to remain in constant contact. As a result, minority languages are being spoken even less in favor of popular languages. This process is called linguistic homogenization. If we follow the slippery slope to the extreme, eventually there will be a single language spoken by all people. This eventuality isn’t likely to happen in our lifetimes, and not just because it requires almost all native speakers of a language to die out. A far more likely scenario is that a handful of commerce languages will be spoken by the vast majority of people. Commerce languages are popular languages people speak to do business in (English, Mandarin, etc).

There are many factors driving linguistic homogenization. Commerce is certainly one of them. In the modern world of the internet and mass media, attention is the scarce resource people are competing for. If you want to capture the attention of others, you need to maximize your reach and doing so typically means choosing a language of commerce. Minority languages present a barrier to the widest possible dissemination of information (except when the only intended audience are speakers of that language). The attention economy promotes linguistic homogenization.

Machine translation services, such as Google Translate, potentially have the power to change this. As the quality of these services improve, it becomes less and less necessary to publish exclusively in commerce languages. Linguistic homogenization may not be the inexorable force it appears to be today. Of course, the output of machine translation can be pretty abysmal. Will the quality of machine translation improve fast enough, and will the business case for them be strong enough to turn the tide of linguistic homogenization?  Those betting on machine translation services surely hope so. But there is a dueling problem here. In order for machine translation to truly counteract linguistic homogenization, it has to be freely available (or ridiculously cheap). These systems are difficult to build and require great computational resources. The outcome will almost certainly be a matter of economics as well as science.

While the future progress of commerce and language may be uncertain, what is certain is that they will continue to heavily influence each other. And there’s nothing new about that.

I hereby declare that the word literally has not lost its meaning, despite a rash of rumors to the contrary.

What would it even mean for a word to lose its meaning? A word can change from one meaning to another, certainly.  Maybe you could argue that a word that has dropped out of usage has lost its meaning..

You hear complaints of that sort all the time, but what is being missed is the fact that language is fluid. Meanings evolve as the need arises (and there are many kinds of  needs). Speakers each carry a somewhat different representation of the language in their heads, and once like-minded speakers agree on a novel usage and adapt it into their own representations, language evolves.

The debate over literally is literally nothing new. Turning to old faithful, the American Heritage dictionary:

Usage Note: For more than a hundred years, critics have remarked on the incoherency of using literally in a way that suggests the exact opposite of its primary sense of “in a manner that accords with the literal sense of the words.” In 1926, for example, H.W. Fowler cited the example “The 300,000 Unionists … will be literally thrown to the wolves.” The practice does not stem from a change in the meaning of literally itself—if it did, the word would long since have come to mean “virtually” or “figuratively”—but from a natural tendency to use the word as a general intensive, as in They had literally no help from the government on the project, where no contrast with the figurative sense of the words is intended.

So literally has been known to be a general intensive for quite some time. Why the fuss now?

Twitter is my new linguistic data collection engine, btw.  Just some of the multitude of great results:

References, “literally,” in The American Heritage® Dictionary of the English Language, Fourth Edition. Source location: Houghton Mifflin Company, 2004. Available: Accessed: January 27, 2009.

This is a subject much larger than the treatment I am about to give it.  Linguistic homogenization occurs in modern states where regional dialects are marginalized and a standard dialect is advanced as the primary method for acceptable public communication.  The powerful favoring a single dialect is nothing new, but now more than ever, states are able to impose this on the wider populace.  European countries encourage one or two primary languages to be taught in school and used in public.  America does something similar with Standard American English.  Speaking a non-standard dialect is often seen as a barrier to employment and movement in higher social circles.  Basically, the snobs keep you down if you don’t talk like they do.

I was reading on Language Log earlier about the Uniformitarian Principle.  Uniformitarianism is simply the idea that things are now as they have always been, so we can learn how things were by learning how they are now.  Language Log describes how modern Europe no longer holds the key to language in prehistoric Europe thanks to the ability of modern states to impose linguistic homogenization.  Think about that for a second.  Modern states, presumably democratic, are so powerful they even tell you how to talk.  Perhaps even how you think.  Is that a paranoid leap?  Am I overstating it?  Even absolute dictators of past centuries didn’t have that kind of power.

But it’s not like one single person is doing this.  Instead they are doing it.  The ineffable they.  But if they are telling us how to think, why do we listen?  We can’t help it, we’re too young when it happens, and then we become them.

Absolute dictators of the past could not do this for many reasons.  They didn’t have the infrastructure to educate the masses, nor did they have popular media to transmit one dialect into every home on a daily basis.  A population too large for all of its parts to remain in constant contact will begin to diverge dialectally.  But educating the masses would have been looked down upon anyway since giving people too many ideas tends to make them question things like a single all-powerful leader calling all the shots.  So now that we are educated enough to know all-powerful dictators are bad news, we have replaced them with power structures more complicated and inscrutable.

A recent post by Daniel Lemire posing a simple mathematical puzzle revealed in stark contrast the bars of my mental prison.  So what are the bars like of this bigger prison we cannot see?  Philip K Dick called it the Black Iron Prison.  I’ve always found that concept appealing.

NACLO 2009

Posted: 15 October 2008 in Uncategorized
Tags: , , , , , ,

The North American Computational Linguistics Olympiad has been announced for 2009.  It’s a great outreach program to high school students to increase interest in general and computational linguistics.  I’ve talked about it before here.  I have reproduced the announcement below the jump. (more…)

The North American Computational Linguistics Olympiad is an annual competition open to US high school students that introduces kids to computational linguistics at a much younger age than people normally hear about it. I didn’t hear about CL until I was three years into my undergrad program. The instant I did hear about it, I knew I wanted to do it. Most people I talk to about it, look like I’ve just uttered a phrase of Klingon. I suspect most people don’t hear about it at all, or if they do, it’s sometime during their undergrad program and not at the beginning, when they might be better able to plan their educational career path. Also, CL is pretty much a graduate program and rarely taught before then. Granted, a lot of the maths involved are beyond what’s taught to high school students and early undergrads, but the linguistics is not. And thinking about linguistics computationally is not. So NACLO is doing an extremely valuable service which I support completely. And not just because one of my professors is one of the General Chairs of the organizing committee for it. She no longer can affect my grade and I have no need to suck up — so this is genuine. How’s that for full disclosure?

One of my google alerts popped up a post on a spam blog I tracked down to this original post, which talks about a lot of young kids doing some great things in science. In the post is an interview with last year’s winner, Adam Hesterberg. He said, “I’d never studied linguistics, and ‘computation’ sounded like boring calculation.” That reminded me of the fact that computation might mean a different thing for most people than it does for scientists. I’m no corpus linguist, so I’m not gonna try to find out right here. What I suspect is that computation has a more “hard work” connotation for people outside of science: it’s the “plugging and chugging” meaning. Inside science, it’s tacked onto the beginning of some other field to mean anything in that field that can be computed. Computational linguistics deals with the computable aspects of linguistic theories. A very quick search on wikipedia finds at least a dozen other computational fields:

Is it a good idea to use this name when approaching high school students? What about language technologies? Well, the competition isn’t about language technologies, it’s about critical problem solving in a linguistics setting. And trying to fit that into a competition name isn’t going to work, either. North American Critical Problem Solving about Linguistics Olympiad (NACPSLO)? It makes me think of narcolepsy.

So my proposal is North American Logic and Language Olympiad (NALLO). It’s easy to say (rhymes with hallow) and accurately describes the subject matter. Plus, I think it has broader appeal. A lot of kids are interested in logic, language, or both. It shakes free of the negative connotation of computation and draws kids where they can be introduced to it a little more easily. The downside is that it doesn’t mention linguistics directly, so that might trouble some people who are a little more traditional about their outreach.

What do you think?

I have been interested in alien (invented) languages since my first brush with elven in the Lord of the Rings. I checked out The Klingon Dictionary from the library in high school and currently own a copy of it and The Languages of Middle Earth.  During high school, I nerdily amused myself by attempting to develop a language for Antarians, which involved gutturals and whistles.  Speaking it myself was nearly impossible and I would occasionally practice, trying to go from a growling sound to a whistle as quickly as my human apparatus would permit.  I imagine the average passerby might have considered calling the police to have me committed, or at least checked for rabies.

New Scientist has a brief article about the possibility of actually preparing for what alien languages might be like.  The argument that Terrence Deacon of UC Berkeley makes (according to the article) is that language serves a purpose.  It is a communication system for describing the world and since the world is in some way a fixed point of reference (though perception of the world is not), then abstract symbolism is a feature common to all languages.

At one point, the study of xenolinguistics would have been a dream job for me.  A nice office at NASA, a field that will probably never be verifiable.  Could you ask for more?

I was asked recently about the motivation for Abney’s DP (determiner phrase) hypothesis. That is, that determiners are not part of English noun phrases but head up their own phrases of which NPs are complements. I couldn’t remember the justification I was given in my Syntax I class, so I went back to the textbook (Syntax: A Generative Introduction by Andrew Carnie). I found the following interesting excerpt:

“… for lack of a better place to put them, we put determiners … in the specifiers of NPs. This, however, violates one of the basic principles underlying X-bar theory: All non-head material must be phrasal. Notice that this principle is a theoretical rather than an empirical requirement (i.e., it is motivated by the elegance of the theory and not by any data), but it is a nice idea from a mathematical point of view, and it would be good if we could show that it has some empirical basis.”

This clashes a bit with my empirical sensibilities. It represents very much the rational point of view in linguistics, that we can probe our own understanding of language by judging what we perceive to be grammatical or ungrammatical. The empiricist view would look at it from another angle: does it appear in data? So the theoretical view might be “nice” but if it is not supported by the data, it is crap.

Treebanks don’t use DPs (at least none that I’ve seen), so automatic parsers typically have no concept of them. I wonder if they would add any value?  I’m guessing they would just run into sparsity issues since another set of tags have to be estimated.   But who knows, the extra structure might be helpful in complex situations.

While watching the 2000 version of Henry James’ The Golden Bowl, I heard the once-common phrase “The deuce only knows…”  I’m always looking for vintage profanity, and this appealed to me strongly.  I’ve heard it hundreds or thousands of times before, of course, but here it was brought to the fore of my attention.  After some brief research, I found ties to 16th Century Northern German, Family Guy, and playing dice.  The word deuce seems most strongly tied in meaning to “the devil,” and is used interchangeably in old-fashioned profanity (cf. What the devil and What the deuce).

There are attested uses of the phrase “Was der Daus!” in German from the 16th Century, which has my money for being the real origin of the phrase.  Daus meant “devil” though the modern German is “Teufel.”  Deuce also means “two” and comes from the French deux.  Supposedly, the combination of the German phrase and the playing of dice led to the phrase entering English usage.  Rolling two (the Devil’s eyes) inspired the curse, since that was the lowest score and therefore, a loss.  I’m not sold on this particular coincidence.  It seems too much like folk etymology of the sort you hear in email forwards.  Lastly, while I enjoy Family Guy enormously when I hear it, I very seldomly get the opportunity to watch an episode, so the tie to Stewie was lost on me until Google unearthed it.

And when OpenEphyra is given the question What is the origin of the word deuce? the answer is “Watkins.”  It offers as evidence this page.  That page poses the question What does the word deuce mean? but the answer has nothing to do with my information need.  Also, the word Watkins never even appears on that page, so no idea where it came from.

In previous posts on cognate identification, I discussed the difference between strict and loose cognates. Loose cognates are words in two languages that have the same or similar written forms. I also described how approaches to cognate identification tend to differ based on whether the data being used is plain text or phonetic transcriptions. The type of data informs the methods. With plain text data, it is difficult to extract phonological information about the language so approaches in the past have largely been about string matching. I will discuss some of the approaches that have been taken below the jump.  In my next posting, when I get around to it, I will begin looking at some of the phonetic methods that have been applied to the task. (more…)