[Elsnet-list] Re: SMS text message project
jeff.allen at free.fr
Sat Jan 23 23:17:29 CET 2010
On Sat, Jan 23, 2010 at 9:53 PM, christopher taylor
<christopher.paul.taylor at gmail.com> wrote:
> there is an organization that is offering a service in which you can
> translate SMS/text messages - the data is public domain - it is a
> crowd source effort and it makes your ability to assist in haiti
> something that you can do from home.
> the data is reviewed by responders and b/c geospatial data is attached
> to the message, you can help gov'ts, NGO's, and aid groups provide
> appropriate responses to areas in need of specific types of aid.
Yes, I was made aware of this SMS project yesterday or the day before, and took
a look at the data.
I'll give my feedback on this from a different angle.
Imagine you were offered the content from thousands of unedited chat IM messages
that were rapidly translated in urgent mode without any opportunity to ask
questions, get context, and just to provide a content gisting draft message.
This is exactly what all human translators complain about in the their
discussion forums. Go read the discussion forums at Proz and
Translatorscafe.com. Lack of context and understanding = inability to
translate the content, even for a living human being with social knoweledge and
the ability to make 2 + 2 = 5.
Would you trust this IM chat content as training corpus for your baseline MT
On the other hand, I just got off the phone with a Haitian Creole content
provider and publisher who clearly agreed with what I wrote in a previous post
about the state of Haitian Creole texts found on the web, and the massive amount
of clean-up and editing work that is necessary to produce publishable content.
If a Haitian-born expert in the Haitian Creole publications field confirms that
unedited text found on the web is questionable as-is, then would you still use
It might instead be better to use SMS content later as a fine-tuning mechanism
to create spell-checkers, spelling normalizers, variable expression indicators
and a number of derivative scripts and applications. But not as a baseline
training corpus, especially when there is only 13,000 other translated sentences
to start with (Still trying to see how much more can be made available).
And please do not try to create an MT system based on all that unknown content
and give it as-is to human Haitian Creole translators to start helping you
improve the engine.
This will simply reinforce the already very negative attitude of human
translators toward MT and maintain MT's bad reputation. The users here are not
grad student guinea pigs for research projects. They are people working on
overstressed Disaster Relief projects in which the number of translation
requests is sky-rocketing above the human translator bandwidth. I just saw a
call for Creole human translators in that there are 200 messages for emergency
services sitting in the queue with noone to translate them.
It's like cyclists who are on the Tour de France climbing through the mountains.
The sponsor promises a 27-speed bicycle with the gear ratio of cogs that can
walk up a wall. And then delivers a 3-speed bicycle and says "Get ready, get
said, go" with no info that it is a basic 3-speed.
Or more closer to our situation. Take the analogy of the Tsunami in 2004. Would
any of you have thrown together a Stat-MT system in a couple of days based upon
the language resource databases for Indonesian that were available at that point
Kirti and Dion, you guys are the experts on Asian-language area MT needs. Given
your expertise on this in real-world translation projects, what would you have
done at that point in time with the content that was available back then. Sure,
the Stat-MT methods have improved a bit, but the question here is content, type
of content, quality of content, and if it would be appropriate to use it in such
critical relief contexts based on a poor data set.
I know what I think about it, based on experience. How about you guys and
Let's provide a Haitian Creole system to real users based on a good training
This is about meeting critical communication needs.
Within a few extra days, maybe a week or two of time, there could be a high
enough volume of clean, quality content that could create a very good SMT based
And now I'm explaining to this Creole content provider the types of licensing
options that correspond with their massive amount of content that could be made
available. It sure helps to have previously worked at the European Language
Resources Association / Distribution Agency (ELRA/ELDA) and having given a
couple of dozen presentations and talks at conferences on language data
distribution issues and licensing schemes (talks/papers downloadable at
my LinkedIn profile). This assists in providing the necessary advice for the
kinds of content that are being considered for distribution.
More information about the Elsnet-list