Fighting over the Nature of the Indus Valley Symbols

Posted on May 4, 2009 | 36 Comments

The most competent research is usually done by the scientist with the best right hook and feint.

If you are talking about research related to the Indus Valley symbols, that is.

A Brief Introduction

As most of us know, the Indus Valley civilization existed for more than half a millennium, about 4500 years ago. They were a rather sophisticated urban peoples and left behind quite a few clues about how they lived. But not enough to settle the most puzzling questions about the ancient history of the Indian subcontinent. They suddenly disappeared, almost overnight in historic terms, after nearly 700 years of city-dwelling, over an extensive region extending from modern Af-Pak to Madhya Pradesh.

Of all the toys they left us- pottery, bricks, bath complexes, drainage systems, statuettes etc.- the most intriguing is the thousands of tablets with several symbols and illustrations. Ever since they were excavated in late 19th century, archeologists and linguists have attempted to explain what these symbols mean.

The two most famous successes in deciphering ancient scripts have been the deciphering of the ancient Egyptian script and the Linear-B from Mycanean Greece. The key to deciphering the Egyptian script was the discovery of the Rosetta stone, a set of inscriptions which consisted of Egyptian language script written alongside their Greek translations. The Linear-B was deciphered thanks to the fact that the symbols represented a pronounciation and syntax almost identical to ancient Greek. Unfortunately for those studying Indian history, the Indus equivalent of a Rosetta stone is yet to be found. It is unlikely that a Rosetta stone will be found on this side of Af-Pak, as extensive inscriptions of Indian languages don’t occur till Asoka, a full one and a half millennia after the ancient Indus peoples disappeared. However, archeological evidence tells us that the Indus Valley civilization conducted trade with Mesopotamia and other ancient civilizations of the near-East. Perhaps, one day we will come across a tablet with the Indus symbols written alongside the languages of ancient Mesopotamia. A Linear-B type decipherment, however, can be attempted only after a few problems described in the following paragraphs are resolved.

An example of the Indus symbols.

Cutting to the Chase

The most important problem yet to be resolved concerns the fundamental nature of these symbols. Do they represent a language, say, like Devanagari, or do they represent a set of symbols, like say, a set of smileys. For over a hundred years this question had been debated the way most scientific issues are- through rigorous research and conjecturing. Recently, in 2004, a paper was published by Steve Farmer, an independent language researcher, Richard Sproat, a Professor of linguistics and EE from UIUC and Michael Witzel, an Indologist from Harvard. I first came across the paper last year about the same time, following an interesting discussion with Vishwesha who had referred me to an article that had appeared in the journal Science, on 6 June 2008.

What struck me when I first saw the paper (in Electronic Journal of Vedic Studies) by Farmer, et al., was the title, which appeared rather arrogant to me- ‘The Collapse of the Indus Script Hypothesis: The Myth of a Literate Harappan Civilization‘. From the title, I thought that the paper was perhaps a survey of all the published work thus far, and that these overwhelmingly pointed towards the fact that the Indus symbols were non-linguistic and did not representing any underlying language. As I read on I realized that the authors were talking about their own work, and proclaiming the ‘collapse’ of an opposing point of view based on their own rigorous and extensive studies. Now, I am not competent in archeology/linguistics/history to comment about the academic merit of their paper. But what disturbed me was the ad hominem attacks that were included in the paper. The paper employed no subtlety in downplaying opposing points of view, and in discrediting researchers who didn’t toe the line of the paper. The paper repeatedly linked the personal/political opinions of the authors from the opposing camp to their research. In addition to presenting arguments supporting their own work, the authors alleged that researchers who thought that the Indus valley script represents a language were motivated by their political affiliations. They repeatedly used words such as “Hindu nationalists” and “Dravidian nationalists”. I agree, the work done by some of the researchers working in that field is trash, but one should counter bad or incomplete research through logical, coherent arguments, which are not adulterated by accusations of the persons intentions. I believe that when arguing about academic work published in peer reviewed journals, it is unprofessional to question the intentions of the author, no matter how bad their research is. Such personal attacks are not permissible in the academic literature. It surprised me how this paper was published in a peer reviewed journal. Perhaps because Michael Witzel is its editor :O ? I wonder if it does any good for the academic merit of one’s arguments to make personal accusations against other researchers in peer-reviewed journal papers!

That said, there are some on the other side of the argument who make similar allegations of ‘imperialist’ research- like N. S. Rajaram, N. Jha etc. But few people take them seriously. They are not as scholarly as Farmer, et al, and their research is far from rigorous. Add to that, the fact that their work is mostly presented in self-published books. However, with a sweeping generalization, Farmer, et al., intended their barbs to all researchers who propose that the Indus symbols represent a language. My suspicion to this effect was confirmed following recent academic backs-and-forths between the Indus-script and the Indus-non-script camps.

Conditional Entropy and the Indus Script

The reputed journal Science, on 23 April 2009, published a paper by Rajesh P.N. Rao, an Associate Professor of Computer Science at UWash, Nisha Yadav and Mayank Vahia from TIFR, Hrishikesh Joglekar, R. Adhikari from IMSc, Chennai, and the reputed linguist Iravatham Mahadevan. The paper was titled, with a humility characteristic of most academic literature, as ‘Entropic Evidence for Linguistic Structure in the Indus Script’. Any EE/CS graduate student will have enough exposure to Information Theory to understand the work presented in Rao, et al. Then again, I am not sure if I can competently judge the direct link they draw between conditional entropy and the nature of the Indus symbols. I am convinced (with some skepticism, though) by their argument that the Indus symbols represent a language. However, I believe it would have been good if Rao, et al. had included an analysis of the conditional entropy of non-linguistic symbols too, rather than just abstract symbols. It would have helped convince the archeologists/linguists to some extent, of the discriminatory ability of conditional entropy, with respect to the linguistic/non-linguistic nature of a set of symbols. The basic premise of Rao, et al. is that symbols representing languages, like say Devanagari or English, form a Markov process with a regularity and sequential order that is somewhere in between that of repetitive non-linguistic symbols (what they refer to as Type 2 non-linguistic symbols) and random non-linguistic symbols (Type 1). The paper goes on to show, as in the figure below, that the conditional entropy of the Indus symbols follows a trend similar to scripts that represent a language (like Tamil or English) rather than non-linguistic. The sequential order is captured by the modelling of symbols as a 1st order Markov process with various transition probabilities. Markov processes are basically characterized by the fact that the future state is dependent only on the present state, irrespective of the past states. Symbols of a language are often modeled as Markov processes. Languages as Markov processes(like the opening roll of Star Wars 😐 ) is the opening sequence of Claude E. Shannon’s seminal paper ‘A Mathematical Theory of Communication‘, in the sections ‘The Series of Approximations to English’ and ‘Graphical Representation of a Markoff Process’. As I understand, the conditional entropy used in the paper by Rao, et al. is the event ‘Symbol j follows Symbol i’ or ‘Symbol j occurs given Symbol i has occured’. The data used was the 417 Indus symbol corpus of Iravatham Mahadevan.

Plot of conditional entropy of symbols of various languages. Image borrowed from the paper by Rao, et al. Science, 23 April 2009

A footnote: Entropy as you may know, is universally and simplistically speaking, a measure of disorder/uncertainty/randomness. Consider two discrete random variables X and Y. The conditional entropy is a measure of the uncertainty/randomness in choosing a value for the random variable Y, given that the event ‘X = x’ has occurred. If Y is completely determined by X, then there is no uncertainty in ‘Y given X = x’, and the conditional entropy will be zero. On the other hand, if X contains no information about Y, then ‘Y given X = x’ is ‘Y’, and the conditional entropy is equal to the simple entropy of ‘Y’.

This is a common problem in cryptanalysis, where you try to determine if a sequence of symbols is a language.

So all’s well – an interesting paper with potentially groundbreaking work published in a reputed journal like Science, etc. Now Farmer, et al. have come up with their refutation of Rao, et al. The refutation is again characteristic of ad hominem attacks against other researchers, and the outright dismissal of opposing viewpoints! A technicality: Farmer, Sproat and Witzel fail to note the novelty of the sequential modeling. I quote a few lines from the acerbic and rather grating refutation by Farmer, Sproat and Witzel below. Your opinions, please:

If the paper (by Rao, et al.) had been properly peer reviewed it would not have been published

Why are Farmer, Sproat and Witzel (FSW) so bellicose in defending their work? If their work has the academic merit, there is absolutely no reason for them to cry this way. I am surprised and disturbed. Is this is the kind of mudslinging that characterizes research on Indian history? Now, I am not alleging ‘imperialist’ suppression of our history, but seriously, why do FSW write like angry little kids who have just been told that they may be wrong?

It is a matter of concern that attempts at conducting objective research about Indian history are scuttled this way- by reputed scientists like Farmer, Sproat and Witzel writing in the language of politicians.

As an EE guy, to me, Rao, et al.’s conclusions are rather convincing. Perhaps an archeologist/linguist would think otherwise. There are many who do, for criticisms of Rao, et al. please refer: Michael Libermann of UPenn, Steve Farmer’s condescension.

Here is Iravatham Mahadevan’s article in the Hindu (which was actually published as I was writing this blog).

Perhaps there are more blogs which support Rao, et al.’s work. I will put up links as and when I find them.

This entry was posted in Indian History, Indus Symbols, Indus Valley Civilization, Mathematics and tagged Indian History, Indus, Indus Script, Indus Symbols, Indus Valley Civilization. Bookmark the permalink.

36 responses to “Fighting over the Nature of the Indus Valley Symbols”

Katri | May 4, 2009 at 7:38 pm | Reply

Kickass post machi. Summed up most of what I had in mind when I was reading the works of Farmer et.al. Could not help but link it to western imperialist rhetoric.
Mahesh | May 4, 2009 at 7:47 pm | Reply

Thanks Katri! The work of Farmer, et al. are in fact behaving no different from researchers on the extremist fringe when it comes to their research!
Pingback: Fighting over the Nature of the Indus Valley Symbols … - Pyramids
ramanan.pg | May 5, 2009 at 5:52 pm | Reply

Consider these:
Aryan invasion theory is nolonger valid.
So vedic culture is indigenous.
Horse which is an alein animal to ancient india.
Horses were originally from steppes mountain range.
Rg veda the most ancient aryan scripture is full of horse reference including ‘ashwamedayaga’.
‘Horse play’ a manupulation of of facts and historical records by Rajaram and Jha to corrupt historical study.
Pastoral aryans and city dwellers of Harrapa.
Indra the destroyer of forts and dams and harappan civilzation.
Pastoral land locked Aryan tribes and the meaning of ‘samudra’
The taboo of crossing the sea/ocean in sanskrit.
Sea trade by non Aryan culture.
Pottery and axe celt from south india.
Aglutunative nature of Harappan script.
Sumerian ‘Oor’ and Nippur’ to places with suffixes ‘Oor’ in Dravidian.
Proof of mother godess worship in Harrapan and their equavalent in south and eastern india, bynative people.
Rg veda owes more than 30 % of word orign to DEDR (Dravidian Entymolgical Dictionary).
AnSVad | May 6, 2009 at 2:47 am | Reply

Very interesting post man… Nice read.
mouli | May 7, 2009 at 3:50 pm | Reply

Hi Mahesh
It is interesting reading but you should consider it as clash of computer scientists deciphering vs Linguists. It is typical of guarding turf. However the deeper malice is Indian History as perceived by Britishers, Congress men,Commies and now over zealous Hidus. I will write more about it later.
Shivakumar Jolad | May 8, 2009 at 1:12 am | Reply

Super analysis and criticism. Arguments by Rao eta al seems very convincing. It is quite disgusting to see acerbic remarks of Farmer et al. Ego supersedes merit for some academicians.
Vijay Shankar | May 8, 2009 at 6:32 am | Reply

Superb Analysis Macchi! Michael Witzel’s main tactics over the years has been the use of bellicose mechanisms in his verbiage against opposing points of view. If you read Koenraad Elst’s many hypotheses on various facets of Indian history, you’ll know what I am talking about! I wonder why they quote him so extensively and why they dont quote people such as Koenraad Elst!
Arun | May 11, 2009 at 4:16 pm | Reply

Please read this too:
http://horadecubitus.blogspot.com/2009/04/indus-what-did-rao-et-al-really-do.html
Mahesh | May 11, 2009 at 8:33 pm | Reply

@Arun: Thank you very much for the link. The comments section of that blog is a very interesting debate between the author, Mark Liberman, Richard Sproat and others.
Richard Sproat | May 25, 2009 at 5:45 am | Reply

I guess my fundamental question about the Rao et al paper all along has been: why did they not conduct a proper study comparing lots of languages and several non-linguistic symbol systems?

I am of course skeptical that conditional entropy alone could tell us much (as is Mark Liberman and Fernando Pereira, two other computational linguistics who, unlike me, certainly have no axe to grind on this issue).

But even if it were, how can one make any assertions about the similarity of an unknown x to two populations Y and Z, without sufficient samples of Y or Z? You can’t, obviously, and this would seem to come down to basic science, the kind we all learned in high school.

Presumably Science (and here I am referring to the journal), would not have published a paper from another field, if it was this thin on substance. But for some reason they seem not to apply the same standards to papers having to do with linguistics as they do to other fields. This was also Pereira’s complaint. There are other examples besides the paper by Rao.

I mean: would it not have been reasonable to have maybe, like, just one computational linguist review it? I would hope that they would not accept a paper in endocrinology without having a qualified endocrinologist review the paper. My guess is that they probably had a few archaeologists (experts on the Indus civilization) review it: but what would they know about computational methods like these?
Richard Sproat | May 25, 2009 at 6:04 am | Reply

By the way, I’m not sure how one should interpret this statement: “The work of Farmer, et al. are in fact behaving no different from researchers on the extremist fringe ”

When I think of the “extremist fringe” I think of the work of Jha and Rajaram, and the Harappan horse. That was a clear case of fraud. I hope there is not an implication that we have been engaging in fraud.
Richard Sproat | May 25, 2009 at 6:34 am | Reply

I particularly do not understand this comment:

“Summed up most of what I had in mind when I was reading the works of Farmer et.al. Could not help but link it to western imperialist rhetoric.”

Western imperialist rhetoric? Where is that coming from? Actually most of the scholars we criticized in our 2004 paper were also Westerners. And wasn’t it Marshall, a part of the British imperial establishment, who first proposed the idea that the Indus was a massively literate society? If we were attacking Marshall, which we were, couldn’t it be argued we were anti-imperialist?

Or is it imperialist to suggest that the earliest Indian civilization may not have been literate after all?

I’m thoroughly puzzled. I have no sympathy whatever with imperialism. But I also have zero tolerance for post-colonial rhetoric that brands as imperialist anyone who doesn’t toe a strictly defined party line.

Nor do I have any sympathy with the nationalist sentiments that seem to come to the fore here. Not that I believe for a minute that, for example, Rao et al were motivated by such considerations. I think Steve tends to jump to conclusions too quickly on that front. But as this blog acknowledges, nationalism has been a factor. But whatever: it’s silly.

Actually why this is even a nationalist issue for some people is beyond me. Yes, I know what the extreme Hindu right wing would like to claim. But when it comes down to it, you have a civilization that lived thousands of years ago, disappeared for reasons unknown, and their only known link with modern Indians is the fact that they happened to occupy part of the same territory. For all we know, maybe they all upped and left and moved to northern Europe, and in fact they are my ancestors.

Okay, I’m being facetious, but in fact given what we know about such things as migrations or genocide in historical times, it is possible to contemplate many fates for the Indus people.

Of course maybe there’s some DNA evidence that bears on this issue.
Arun | May 25, 2009 at 2:30 pm | Reply

If Witzel, Farmer and Sproat could have addressed the content of Rao et. al., without bringing in nationalism, “garbage in, garbage out” and other such nonsense, no one would have a bone to pick with you. Having observed Witzel on the web since the mid-90s, and now his hanger-ons, I am profoundly unimpressed by them; they lack class.
Arun | May 25, 2009 at 2:32 pm | Reply

Rao & co have a response to W,F & S here (PDF file)

Click to access IndusResponse.pdf

http://www.cs.washington.edu/homes/rao/IndusResponse.pdf
Mahesh | May 25, 2009 at 2:37 pm | Reply

Professor Sproat,

Thank you for the clarification that you do not believe that the academic work of Rao, et al. is motivated by political considerations. I framed opinions entirely based on the three papers – ‘The Collapse of the Indus Script Thesis…,’ ‘Entropic Evidence of…’ and ‘Refutation of the Claimed Refutation of…’.

I want to clarify that, when I say “behave like the extremist fringe,” I am only referring to the aggressive defense that is often characteristic of the work of the extremists.

That said, my only concern is about accusing a fellow academicians’ peer reviewed work (irrespective of its merit) of being driven by ideology. Shouldn’t scientific work be countered based on its merit and ideas, rather than the alleged political affiliations of the authors? That kind of ad hominem attacks and bellicose rhetoric is characteristic of the ‘publications’ of the extremist fringe. I want to state in no uncertain terms that I am *not* in anyway implying that your work is fraud. Far from it, I believe beyond any doubt that it is honest and rigorous. I am only commenting about the ‘behavior’- the personal attacks, and the discrediting of other researchers.

I don’t understand the need to accuse other researchers of having political agendas.

These allegations were actually made in a peer-reviewed journal, and effectively taints all research done by the authors Rao, et al. You accuse Iravatham Mahadevan of being an ‘Indian nationalist’. How does it strengthen the argument in anyway, if one points out the political affiliation of Mahadevan? Is it essential to refuting their work?

Iravatham Mahadevan is the 5th author, and in all likelihood, his contribution would have been minimal, considering that his prior publications indicate that he perhaps has no expertise in information theory. To me, none of the earlier papers published by the first four authors Rajesh P. N. Rao, Nisha Yadav, Mayank N. Vahia, Hrishikesh Joglekar, R. Adhikari seem to indicate that they have any political affiliations at all. So why accuse them of guilt by association (with Mahadevan)?

I would like to say that I would have had absolutely no issues, *if* the Indus-symbol-publications of FSW had been purely academic, without any references to the political affiliations of other authors. What *is* wrong with a researcher being an Indian nationalist? More importantly, what is the definition of the term ‘Indian nationalist’? I would understand terms like ‘Republican’ or ‘Democrat’ because there is a Republican Party, a Democratic Party, and there are registered affiliates to each party, giving you a rigorous definition of the two words. But what does the term ‘Indian nationalist’ mean? If it is being used in academic work, shouldn’t it have a rigorous definition? Shouldn’t one present rigorous arguments, admissible in academic research, that establish beyond doubt that any article with Iravatham Mahadevan as an author is motivated by his political affiliations?

The linking of academic research with political affiliations reminds me of the Soviets- who routinely discredited scientific research from the west by associating it with political ideologies. A startling example is how the chromosomal theory of inheritance was trashed for being anti-Marxist.

Why can’t researchers of Indian history just set aside all this talk of political affiliations? That is one of the things that I like about the paper by Rao, et al.- there are no personal attacks or allegations, just a continuous stream of hypothesis, simulations and conclusions. Whether their arguments are right or wrong will be decided over the course of time.
Mahesh | May 25, 2009 at 2:48 pm | Reply

I have mentioned in this post that I *do not* think that the papers by FSW are “imperialist”.
Richard Sproat | May 25, 2009 at 4:02 pm | Reply

Okay thanks Mahesh, that clarifies things a lot. I appreciate your feedback. And I agree with you completely that politics has no place in science. I also appreciate your clarification about the fraud issue: I am sorry for thinking that is what you might have meant, but in the context it was not completely clear to me.

I am myself not interested in the political issues in this arena. I do understand where Steve and Michael are coming from, insofar as they were roundly attacked — even verbally threatened — by right-wingers after their “Horseplay” paper appeared in Frontline. But that is their baby, not mine.

You could argue that as a coauthor I have some responsibility, and you would be right. All I can say is that I do sometimes try to moderate what Steve writes when it comes to the political stuff, but I am not usually successful.

As to what was said to the press Arun, I wonder if any of you noticed that the press interviewed Steve and they interviewed Michael, but hardly anyone from the press bothered to contact me — which is odd since of the three of us, I am the one most qualified to comment on the Rao et al paper. In any event, I take no responsibility for how Michael or Steve phrased their responses to reporters. Had they interviewed me, I would have told them what I wrote here, namely that I think that the Rao et al paper fails the test of basic scientific rigor, for the reasons I already gave.

Finally, since Arun also notes the link to Rao et al’s response, I will just point out a couple of things. A lot of the issues that he raises in that response are discussed elsewhere: for example, the supposed 26-glyph inscription that figures in one of the plots they use is based on an inscription that is on multiple sides of a polygonal piece, which nobody can argue constitutes a single text.

But the most bizarre aspect of that response for me is the restatement of the purpose of Types 1 and 2. In the original paper, these were billed as representative of two important classes of non-linguistic symbols — though the evidence they give that there ever were non-linguistic symbols that were completely rigid or completely random, is suspect.

Now he’s saying these were just done to set the bounds of the distribution.

So which is it?

In any case, under either interpretation, it tells us nothing that the Indus symbols fall between these two extremes and in the “linguistic” range, for the simple reason that without sampling the space of non-linguistic symbol systems, we don’t know whether there are other non-linguistic symbol systems that also fall in that space. And we don’t know how broad the “linguistic” range is because they have picked such a small number of languages.

–R
Richard Sproat | May 25, 2009 at 5:01 pm | Reply

I’d like to also point out some of the things that are not being discussed much or at all in these debates. Their omission from almost any discussion — not just on this particular forum, but anywhere — is odd, and may say something about the sociology of this whole area. I happen to think they are the real questions that people should be asking.

First, as Mahesh says, time will tell who is right. But I think that it’s crucial to understand that while it could in principle turn out that Rao et al are right (and we are wrong) about the script hypothesis, it is doubtful that their line of argument could itself be right. Obviously I need to explain that: simply put, as I think has been said already elsewhere, statistical arguments are never going to decide this issue. The only things that would decide it are:

1) A very long text, which would make the non-script hypothesis look very implausible.

2) A clear example of a Rosetta stone.

3) A verifiable decipherment.

As far as purely statistical arguments, or the kinds of structural analyses that Rao and the Tata Institute team plan: well these have been done before. The Finnish team did some very detailed structural analyses 40 years ago. They discovered that the texts do indeed have a structure. They even inferred a grammatical analysis over the texts. So far so good. But their mistake was assuming that something that has a grammar must ipso facto be a (natural) language. In fact lots of things have grammars: mathematical equations, music notation, heraldic symbology… All of these can be described using grammars, and none of them are themselves linguistic in the sense Rao et al intend. So my guess is that Rao and his team will find lots of structure. But what will that tell us?

Second, as long as the texts we continue to dig up are the short cryptic texts that so far comprise the entire corpus, people need to understand that the chances of decipherment for such a corpus are slim. Even if we knew the underlying language, there are so many possibilities for the mapping that it is virtually impossible that one will come up with a decipherment that will convince anyone else. What is not discussed in these fora is the fact that 40 years ago Parpola and his team announced that the Indus code had been “cracked”. Since then there has been very little follow up, and what there has been seems quite implausible — e.g. his attempts to link some bangle-like symbols to Marukan, a gulf of a few thousand years. In any case, his most famous proposal, the astronomical fish series, fails to convince for the simple reason that apart from his own claims about what these symbols encode, there is zero independent evidence that the texts involved had anything to do with astronomy. So one thing lacking in this whole discussion is any sober assessment of the likelihood that, even if it were a script, we’d have a chance of succeeding in decipherment. The outcome of what is usually considered the most “plausible” of such attempts should give us pause.

Finally, turning the tables, the oddest thing for me is that few (or maybe no-one, because I haven’t actually seen this question asked) have bothered to ask: okay Farmer et al, so if it isn’t a script, then what is it? Can you provide a *testable* theory of what it was? Everyone seems to be so focussed on the nature of the debate, and the script issue, that nobody seems to be asking the obvious question. Well, Farmer thinks he knows what it is (an agricultural religious symbology — he’s been working with some paleobotanists on this), but how would you come up with a falsifiable model? I don’t have a good answer myself (yet), but what’s interesting for me is that nobody is asking this question.

Anyway, these are some of the issues that should be figuring in this discussion, but aren’t because people have been focusing on the rhetoric and not on the scientific aspects of this issue.
Gowtham | May 25, 2009 at 10:03 pm | Reply

Holy cow! This is some post – given that I don’t have much of a background in this topic [apart from reading this as well as an in-person discussion some weeks ago with the blog post’s author], it made for a good review.

Seeing that Professor Sproat himself has taken time to comment on [and post follow up comments] tells me I should probably read all the associated/linked papers/publications before coming up with some thoughts of my own.

From little personal experience that life has given me, I have had to face some attacks during manuscript review process (during the good ole grad school days) that linked my writing & research to my ethnic background.

While it is naive to assume that the world will be without ego and prejudice, may be the Editor In Chief of (reputed) journals owes it to him(her)self as well as to the rest of reading community to make sure personal conflicts don’t show up in print.
Arun | May 27, 2009 at 12:20 pm | Reply

I think Rao et al have opened a interesting line of investigation.

Consider entries in a telephone directory, each entry considered to be equivalent of one inscription. They are of limited length, and differ from language in general in that they are mostly names, and addresses. Can conditional entropy distinguish these from a regular text (say from a novel)?
Arun | May 27, 2009 at 12:29 pm | Reply

One might expect that the statistics of written language ultimately derive from the constraints the human vocal system puts on what can be said, and thus on what is written. In this context, it would be interesting to repeat Rao & co’s analysis on –

– birdsong represented in symbols
– written representations of music
– digitized audio streams (e.g., every 8 bits defines a symbol)
– written representation of Bushman click language

Other things to look at
– messages in a text-based protocol such as SIP
-(previously mentioned) telephone directory
Richard Sproat | May 27, 2009 at 4:05 pm | Reply

Yes you will get a different profile for text from telephone directories. You will also get a different profile (I have actually done this) if you look at, say, Chinese newspaper headlines versus general Chinese newspaper text.

This is an important point: it underscores the fact that the notion that a single measure like conditional entropy can capture what is essential about “language”. It depends heavily on many factors: the granularity of what you are measuring (Rao et al. show this by comparing English words versus English letters, but of course this has been known since Shannon), the genre of text, and so forth. Given all those considerations, and understanding them, one cannot help conclude that the similarity between the curves of Indus and of Old Tamil that they seem rather happy about must simply be coincidence. (Actually it’s not clear they are really so similar.)

On Arun’s second message: yes it would be informative to try a whole range of things. I’d throw in a few other non-trivial man-made non-linguistic systems: dance notation, math, heraldry. I think at the end of the day you will find that this doesn’t help us distinguish linguistic from non-linguistic things, nor will linguistic texts fall into a nice narrow band as Rao et al. imply.

One correction: Bushman (more properly !Kung San) is not going to be structurally unusual as a human language. It will of course show its own profile, depending of course very heavily on what you choose to sample. But your inclusion of it in this list suggests that you might think it is radically different from human languages in general and it isn’t.
Richard Sproat | May 27, 2009 at 4:07 pm | Reply

Oops, incautious editing:

This is an important point: it underscores the fact that the notion that a single measure like conditional entropy can capture what is essential about “language”

should read:

This is an important point: it underscores the fact that the notion that a single measure like conditional entropy can capture what is essential about “language” is ill-conceived.
Mahesh | May 27, 2009 at 4:29 pm | Reply

Professor Sproat,

1. I agree with you that arguments about this paper often fail to discuss the issues that you talk about. In my view, the reason for all the rhetoric in discussions about the Indus script is a result of the aggressive language and unnecessary allegations that occur in “Collapse of…” Were it the case that “Collapse of…” and “Refutation of…” *only* argued on the academic merits of other papers, there would have been little, or no rhetoric in academic discussions related to this topic. The personal allegations are distractions to the academic issues that have been discussed.

2. You mention about prior efforts of applying statistical measures to establish the nature of a set of symbols… As far as I am aware, this is the first attempt which works with a sequential/Bayesian model in assuming an underlying Markov process. Again, I do not claim that this statistical measure is effective. But when it is repeatedly claimed that a single statistical measure cannot represent language, shouldn’t one consider the fact that the approach by Rao, et al. uses a Bayesian measure, while earlier work mostly used arguments based on character frequency? It is a significant departure. As most people in EE/CS would agree, Bayesian approaches are often very powerful compared to non-Bayesian approaches. This is perhaps the novel contribution of Rao, et al. Given the nature of spoken language, one would naturally expect that sequential models may accurately represent written representations based that encode speech.

My question is, considering the obvious sequential structure in symbols that encode language, have conditional statistics ever been used in this field? Since Rao, et al. and other articles that talk about their work do not cite any references to past literature in conditional statistics applied to linguistics, I am assuming that Rao, et al is the first paper to use conditional probability based methods to characterize written symbols.

If conditional statistics haven’t already been applied to analyzing written symbols- when applied they are indeed expected to yield more information than non-Bayesian statistical measures such as character/bi-gram frequency. Whether this information is

These are the two conclusions that we have:

1. As mentioned in “Collapse of…,” non-Bayesian analysis of symbol frequencies do not necessarily indicate the linguistic nature of a set of symbols.

2. The paper proposing Bayesian analysis of symbols is not entirely convincing, but does indeed make a strong case for using such methods to study symbols and their underlying meanings.

Based on these two conclusions, can we dismiss Bayesian methods as being ‘yet another statistical measure’? I feel what is needed to counter Rao, et al. is a contradiction, where a set of non-linguistic symbols falls in the so-called ‘linguistic region’ of the conditional entropy plots. There is no point in fighting over straw men constructed by both sides. One can pretty much apply the same arguments against the Type-1-Type-2 results in Rao, et al. and the simulations by Sproat, Libermann, Shalizi and others. Shouldn’t we accept the results presented in Rao et al, with skepticism, rather than outright dismissal?
Mahesh | May 27, 2009 at 4:33 pm | Reply

Correction, I forgot to complete the line:

“Whether the extra information we gain from Bayesian methods is useful or not, is an important issue to ponder.”
Arun | May 27, 2009 at 6:54 pm | Reply

Re: Bushmen language, and other examples – since my speculation was that the nature and sequence of sounds that the human vocal apparatus can emit is what ultimately underlies the statistical regularities in the written language — isn’t the written a representation of the spoken? — it is of interest to look at other systems of sounds. !Kung san may not differ structurally from other languages, but I thought that the vocalizations are certainly different. That is also why bird song, music, and digital sound are included there.

The telephone directory and text-based protocols go in a different direction, where I am speculating that the short Indus inscriptions bear the same relationship to the Indus language as these short strings do to English (say).
Richard Sproat | May 27, 2009 at 7:34 pm | Reply

I don’t have much time right now so I’ll keep my responses brief, and possibly not answer all the questions now.

Conditional entropy specifically has not been used before, but the Finnish work certainly used sequence information. That’s how they were able to induce structure. They used techniques due to Zellig Harris, which can be thought of as early instances of language modeling.

Note for the record that we would have done the same in our paper, but we did not have access to and Indus corpus, just the frequency distributions (from Mahadevan and Wells). Actually you should ponder the fact that the Indus researchers have kept a tight lid on their electronic corpora — in the case of Parpola for 40 years. One wonders why they have not been more open in allowing other researchers to use it. This is completely different from, say, the Phaistos Disk, where you can download the texts from Wikipedia, or the Easter Island symbology, where http://www.rongorongo.org has made Barthel’s encoding of the corpora available for some years. You can also find corpora of Linear A online. Only the Indus stuff has been so tightly controlled. Anyway, we probably would have done something similar to what Rao et al did had we had the corpora. But the basic point anyway is that the Finnish team certainly used more than unigram statistics in their work.

As for most people in EE/CS agreeing that Bayesian approaches are a significant departure … Maybe, but the question here is whether they would agree that it’s a useful measure for determining whether something is linguistic or not. Pereira, who is about as good a computer scientist as you could get clearly doesn’t think so. But I admit to not having taken a poll.

On this: “I feel what is needed to counter Rao, et al. is a contradiction, where a set of non-linguistic symbols falls in the so-called ‘linguistic region’ of the conditional entropy plots.”

Yes well actually I have such a plot that shows this for European Heraldic symbology. I’ll publish it soon. My only caveat is that it is based on blazon, which is a formal language used to describe the symbols and their arrangement. It *looks* like English (with lots of weird words), but it is in fact a formal language that is unambiguously convertible back and forth to the original herald. The plot also shows that the linguistic range is rather wide.

On Arun’s comment on Bushman. I see, yes the sounds are quite different, though lots of languages (not related to !Kung San) in the region also have these clicks. For example the Bantu language Xhosa. (That “Xh” in the name is in fact a click.) But the real issue would be the distribution of the sounds, and while I don’t know this for sure since I have never seen a study on this, I would seriously doubt they differ radically in their distribution and combinatorics from any other language that has roughly the same sized phoneme inventory and roughly the same syllable complexity as !Kung San.

But these are good points to be raising since if nothing else they serve to illustrate just how complex the issues are, and how unlikely it is that a single measure is going to answer the question.
Richard Sproat | May 28, 2009 at 12:12 am | Reply

Okay a little more time now.

Re: “I agree with you that arguments about this paper often fail to discuss the issues that you talk about. In my view, the reason for all the rhetoric in discussions about the Indus script is a result of the aggressive language and unnecessary allegations that occur in “Collapse of…” Were it the case that “Collapse of…” and “Refutation of…” *only* argued on the academic merits of other papers, there would have been little, or no rhetoric in academic discussions related to this topic. The personal allegations are distractions to the academic issues that have been discussed.”

Actually I don’t think that’s it, or at least it’s not a complete explanation. People have been discussing lots of putatively empirical issues. Unfortunately, in the main, these have been rehashes of stuff that has been dealt with before: Brahui (a Dravidian language) in Southwest Pakistan; the possibility of some massive literature on perishable materials; etc. All of these are old issues.

So it’s not that people have been so distracted by the rhetoric that they fail to discuss substantive issues. The problem is they don’t seem to be discussing the right issues.
Richard Sproat | June 2, 2009 at 4:29 am | Reply

As promised:
Arun | June 3, 2009 at 11:58 am | Reply

That is a great graph!
Richard Sproat | June 3, 2009 at 12:23 pm | Reply

Thank you Arun.
Mahesh | June 3, 2009 at 12:36 pm | Reply

Thanks Professor Sproat. The graph is very interesting. It will be interesting to see how Rao, et al respond to this.
Richard Sproat | June 3, 2009 at 12:39 pm | Reply

Yes. Perhaps we will find out.
Pingback: Indus Research: Press
Arvind Saibaba | April 23, 2010 at 9:02 pm | Reply

No offense to anybody involved but I think this is relevant here

http://xkcd.com/114/