Tech tip: Did I miss the Amara memo? Easy subtitling!

This article originally appeared on Hoyden About Town.

Cross-posted from my group tech blog, But Grace.

Amara (Universal Subtitles) is great stuff! Apologies to everyone for whom it is old news: I had heard of it before but not bothered to check it out, assuming it would be super-hard and fiddly. I really didn’t find it so.

How it works: you find a video on a popular video website (Youtube, Vimeo, Dailymotion, or several codecs downloaded directly for that matter) that doesn’t have subtitles. You submit the URL into the Amara website and a tool opens up that lets you enter subtitles for the video, in three steps:

  1. type in all the subtitles a line at a time as you pause and restart the video (assuming you need to, professional closed captioners may not need to)
  2. sync the subtitles with the speech (by pressing a single key every time it’s time to start a new subtitle)
  3. review and publish

I am especially amazed at how easy it is to get a good-enough (I think?) sync of subtitle and speech when playing the video at full speed and just hitting the down arrow to advance to the next subtitle. Amara also provides embed codes that allow you to embed their subtitles with the original video in another webpage, which is crucial because I want to embed videos more often than I want to link to them. Finally, you can pull your subtitles out afterwards in text format, which means you can create a more complete transcript for separate publication.

Last of all, it is not a for-profit enterprise, it is a product of the Participatory Culture Foundation and the Amara code is itself open source. So it is not hostage to a commercial motive but is genuinely created with the central motive of providing more subtitled video on the web.

It does have some limitations: most noticeably for me, the controls over rewinding are a bit coarse-grained (go back 4 seconds and… that’s about it) and they don’t seem to have a facility for slowing the video down, which can help me transcribe fast speech.

They have a short introduction video about themselves (subtitled!):

(
{“video_url”: “http://vimeo.com/39734142”}
)

As a demonstration of what user subtitled content looks like, here’s a subtitled version (not by me) of Karen Sandler’s keynote at linux.conf.au 2012, about medical devices and source code (in her case, trying to get the source code of her pacemaker):

(
{“video_url”: “http://www.youtube.com/watch?v=5XDTQLa3NjE”}
)

The text version of the subtitles is also available.

Alternative

dotSUB (which I’ve never used either) is an alternative, reviewed positively in comments at FWD, from a for-profit company.

Did I miss the Amara memo? Easy subtitling!

Amara (Universal Subtitles) is great stuff! Apologies to everyone for whom it is old news: I had heard of it before but not bothered to check it out, assuming it would be super-hard and fiddly. I really didn’t find it so.

How it works: you find a video on a popular video website (Youtube, Vimeo, Dailymotion, or several codecs downloaded directly for that matter) that doesn’t have subtitles. You submit the URL into the Amara website and a tool opens up that lets you enter subtitles for the video, in three steps:

  1. type in all the subtitles a line at a time as you pause and restart the video (assuming you need to, professional closed captioners may not need to)
  2. sync the subtitles with the speech (by pressing a single key every time it’s time to start a new subtitle)
  3. review and publish

I am especially amazed at how easy it is to get a good-enough (I think?) sync of subtitle and speech when playing the video at full speed and just hitting the down arrow to advance to the next subtitle. Amara also provides embed codes that allow you to embed their subtitles with the original video in another webpage, which is crucial because I want to embed videos more often than I want to link to them. Finally, you can pull your subtitles out afterwards in text format, which means you can create a more complete transcript for separate publication.

Last of all, it is not a for-profit enterprise, it is a product of the Participatory Culture Foundation and the Amara code is itself open source. So it is not hostage to a commercial motive but is genuinely created with the central motive of providing more subtitled video on the web.

It does have some limitations: most noticeably for me, the controls over rewinding are a bit coarse-grained (go back 4 seconds and… that’s about it) and they don’t seem to have a facility for slowing the video down, which can help me transcribe fast speech.

They have a short introduction video about themselves (subtitled!):

(
{“video_url”: “http://vimeo.com/39734142”}
)

As a demonstration of what user subtitled content looks like, here’s a subtitled version (not by me) of Karen Sandler’s keynote at linux.conf.au 2012, about medical devices and source code (in her case, trying to get the source code of her pacemaker):

(
{“video_url”: “http://www.youtube.com/watch?v=5XDTQLa3NjE”}
)

The text version of the subtitles is also available.

Why subtitle stuff? You can provide a translation into other languages, as most people are familiar with. But subtitling things into the written form of the language they’re spoken in is also very useful. Several reasons:

  • it makes the video accessible to hearing-impaired people;
  • it makes the video accessible to anyone who can’t listen to the sound right at that second; and
  • the existance of the text version of the subtitles makes the video at least more accessible to readers who can’t watch video or don’t have time to.

Sunday Spam: bagels, lox and smoked salmon

In belated honour of my breakfast in New York, Sunday July 8.

Baby Loss and the Pain Olympics
Warning for baby loss discussion.

I really have to question why seeing someone else processing their emotions is her pet peeve.

Do I believe a miscarriage and neonatal death is the same thing — of course not. If they were the same thing, they would share the same term. But just because I see them as apples and oranges doesn’t mean that I don’t also see them as fruit. They are both loss.

The deadly scandal in the building trade

Readers would not guess from the “national conversation” that the construction industry is sitting on a story as grave in its implications as the phone-hacking affair – graver I will argue. You are unlikely to have heard mention of it for a simple and disreputable reason: the victims are working-class men rather than celebrities… The construction companies could not be clearer that men who try to enforce minimum safety standards are their enemies. The files included formal letters notifying a company that a worker was the official safety rep on a site as evidence against him.

On Technical Entitlement

By most measures, I should have technical entitlement in spades… [and yet] I am very intimidated by the technically entitled.

You know the type. The one who was soldering when she was 6. The one who raises his hand to answer every question–and occasionally try to correct the professor. The one who scoffs at anyone who had a score below the median on that data structures exam (“idiots!”). The one who introduces himself by sharing his StackOverflow score.

Puzzling outcomes in A/B testing

A fun upcoming KDD 2012 paper out of Microsoft, “Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained” (PDF), has a lot of great insights into A/B testing and real issues you hit with A/B testing. It’s a light and easy read, definitely worthwhile.

Selected excerpts:

We present … puzzling outcomes of controlled experiments that we analyzed deeply to understand and explain … [requiring] months to properly analyze and get to the often surprising root cause … It [was] not uncommon to see experiments that impact annual revenue by millions of dollars … Reversing a single incorrect decision based on the results of an experiment can fund a whole team of analysts.

When Bing had a bug in an experiment, which resulted in very poor results being shown to users, two key organizational metrics improved significantly: distinct queries per user went up over 10%, and revenue per user went up over 30%! …. Degrading algorithmic results shown on a search engine result page gives users an obviously worse search experience but causes users to click more on ads, whose relative relevance increases, which increases short-term revenue … [This shows] it’s critical to understand that long-term goals do not always align with short-term metrics.

Angels & Demons

One of the various Longform collections, and like many of them, a crime piece:

On June 4, 1989, the bodies of Jo, Michelle and Christe were found floating in Tampa Bay. This is the story of the murders, their aftermath, and the handful of people who kept faith amid the unthinkable.

On Leaving Academia

As almost everybody knows at this point, I have resigned my position at the University of New Mexico. Effective this July, I am working for Google, in their Cambridge (MA) offices.

Countless people, from my friends to my (former) dean have asked “Why? Why give up an excellent [some say ‘cushy’] tenured faculty position for the grind of corporate life?”

Honestly, the reasons are myriad and complex, and some of them are purely personal. But I wanted to lay out some of them that speak to larger trends at UNM, in New Mexico, in academia, and in the US in general. I haven’t made this move lightly, and I think it’s an important cautionary note to make: the factors that have made academia less appealing to me recently will also impact other professors.

Ethics, Culture, & Policy: Commercial surrogacy in India: A $2 billion industry

Since its legalization in 2002, commercial surrogacy in India has grown into a multimillion-dollar industry, drawing couples from around the world. IVF procedures in the unregulated Indian clinics generally cost a fraction of what they would in Europe or the U.S., with surrogacy as little as one-tenth the price. Mainstream press reports in English-language publications occasionally devote a line or two to the ethical implications of using poor women as surrogates, but with few exceptions, these women’s voices have not been heard.

Sociologist Amrita Pande of the University of Cape Town set out to speak directly with the “workers” to see how they are affected by such “work.”

More falsehoods programmers believe about time

Noah Sussman has Falsehoods programmers believe about time, including:

All of these assumptions are wrong

  1. There are always 24 hours in a day.
  2. Months have either 30 or 31 days.
  3. Years have 365 days.
  4. February is always 28 days long.
  5. Any 24-hour period will always begin and end in the same day (or week, or month).

As is usual with these kinds of things, he’s only scratching the surface (even though there’s a lot more than in that excerpt). Andrew and I came up with several more already, on the subject of timezones:

  1. All timezones are vertical lines around the globe evenly spaced in 15 degrees intervals.
  2. All timezones are a whole number of hours offset from UTC.
  3. All timezones are no more than 12 hours offset from UTC.
  4. Two cities within some sufficiently small distance must be in the same timezone.
  5. Two cities with the same longitude must be in the same timezone.
  6. A city further to the east of another city must have a time ahead of or equal to the more western city.
  7. There will only be one timezone within any political boundary.
  8. Within a sufficiently large political boundary, there will be different timezones.
  9. Timezone designations like ‘EST’ are unambiguous.*
  10. Daylight savings shifts occur on the same day around the globe.
  11. Or at least within a hemisphere.
  12. Or at least within a continent.
  13. Or at least within a nation.
  14. Daylight savings shifts occur on predictable dates announced ‘sufficiently far’ in advance that there can be an exhaustive listing of them accurate for the next couple of decades.
  15. Well, at least the next few years.
  16. OK, surely at least this month?

* Both Australia and the United States call their east coast timezone this in winter, and guess what: it’s never the same time in New York as it is in Sydney, and the daylight savings status is seldom the same either. (If you’ve seen Australians call it ‘AEST’, well, yes, we do. Sometimes.)

Useful LaTeX packages: linguistic examples

This is the conclusion of a short series of entries on LaTeX packages I found useful while preparing the examination copy of my PhD thesis.

Today’s entry is a package for displaying linguistic examples (ie, samples of text which you then want to discuss and analyse).  The LaTeX for Linguists Home Page is a good general resource for linguists and computational linguists using LaTeX. I discuss gb4e here because I had to do some messing around to get it to display example numbers the way I want (and the way my supervisor wanted: he likes in-text references to look like “example (4.1)” rather than “example 4.1”), and to get it to work with cleveref, and no one seems to have written that up to my knowledge.

gb4e

gb4e is a linguistic examples package.

usepackage{gb4e}

Input looks like:

begin{exe}
ex This is an example sentencelabel{example}
ex This is another example sentence.
end{exe}

This is a cleveref reference to cref{example}.
This is a normal reference to example (ref{example}).

You can mark sentences with * and ? and so on:

begin{exe}
ex[*] {This is an sentence ungrammatical.}
ex[?] {This is an questionably grammatical sentence.}
end{exe}

You can do sub-examples:

begin{exe}
ex This is an example.
ex
begin{xlist}
ex This is a sub-example.
ex This is another sub-example.
end{xlist}
end{exe}

A few things to do to make gb4e play really nicely. First, some cleveref config. gb4e doesn’t yet automatically tell cleveref how to refer to examples, so you need to tell it that the term is “example”, and second, if you want braces around the number (“example (1.1)” rather than “example 1.1” you need to tell it to use brackets:

% tell cleveref to use the word "example" to refer to examples,
% and to put example numbers in brackets
crefname{xnumi}{example}{examples}
creflabelformat{xnumi}{(#2#1#3)}
crefname{xnumii}{example}{examples}
creflabelformat{xnumii}{(#2#1#3)}
crefname{xnumiii}{example}{examples}
creflabelformat{xnumiii}{(#2#1#3)}
crefname{xnumiv}{example}{examples}
creflabelformat{xnumiv}{(#2#1#3)}

Also, by default, the gb4e numbering does not reset in chapters. That is, your examples will be numbered (1), (2), (3) etc right through a thesis. You probably want more like (1.1), (1.2), (2.1), (2.2), ie chapter.number. Change to this with the following in your preamble:

% Store the old chapter command so that
% our redefinition can still refer to it
letoldchapterchapter
% Redefine the chapter command so that it resets the
% 'exx' counter that gb4e uses on every new chapter.
renewcommand{chapter}{setcounter{exx}{0}oldchapter}

% Redefine how example numbers are shown so that they are
% chapter number dot example number
renewcommand{thexnumi}{thechapter.arabic{xnumi}}
You could also get it to reset in sections by replacing chapter and thechapter with section and thesection in the above.

Thanks to the TeX Stack Exchange community for their help with this. See Section based linguistic example numbering with brackets for more information.

Useful LaTeX packages: within document references

This is part of a short series of entries on LaTeX packages I found useful while preparing the examination copy of my PhD thesis.

Today’s entry is packages relevant to preparing within document references. These are both fairly new to me, although not absolutely now.

hyperref

This package turns cross-references and bibliography references into clickable links in your output PDF (at least if you generate it with xelatex or pdflatex), without you having to do anything other than the ref (or cleveref’s cref) and cite and so on commands.

usepackage{hyperref}

You will probably want to modify its choice of colours to something more subtle:

usepackage[citecolor=blue,%
    filecolor=black,%
    linkcolor=blue,%
    % Generates page numbers in your bibliography, ie will
    % list all the pages where you referred to that entry.
    pagebackref=true,%
    colorlinks=true,%
    urlcolor=blue]{hyperref}

Use black if you want the links the same colour as your text.

One note with hyperref: generally it should be the last package you load. There are occasional exceptions, see Which packages should be loaded after hyperref instead of before?

cleveref

cleveref is a LaTeX package that automatically remembers how you refer to things. So instead of:

see chapter ref{chapref}

you use the cref command:

see cref{chapref}

It handles multiple references nicely too:

see cref{chapref,anotherchapref}

will generate output along the lines of “see chapters 1 and 2”.

Use

Cref{refname}

to generate capitalised text, eg “Chapter 1” rather than “chapter 1”

To use it:

usepackage{cleveref}

It shortens the word “equation” to “eq.” by default, if you don’t like that, then:

usepackage[noabbrev]{cleveref}

For some packages that don’t yet tell cleveref how to refer to their counters, you will get output like “see ?? 1” rather than “see example 1”. You use the crefname command in the preamble to tell it what word to use for each unknown counter, examples of crefname will be shown tomorrow for gb4e.

Useful LaTeX packages: tables and figures

This is part of a short series of entries on LaTeX packages I found useful while preparing the examination copy of my PhD thesis.

Today’s entry is packages relevant to preparing tables or figures. Again, some are pretty widely known and some aren’t.

rotating

If you have a big table or figure that should be rotated sideways onto its own page:

usepackage{rotating}

And then you can replace the table and figure commands with:

begin{sidewaystable}
%Giant table goes here
end{sidewaystable}
begin{sidewaysfigure}
%Giant figure goes here
end{sidewaysfigure}

dcolumn

The dcolumn package produces tabular columns that are perfectly aligned on a decimal point (ie all the decimal points in that column are exactly underneath each other), which is usually how you want to display decimal numbers.

usepackage{dcolumn}

% create a new column type, d, which takes the . out of numbers, replacing the .
% with a cdot and aligning on it.
newcolumntype{d}[1]{D{.}{cdot}{#1}}

Now that you have defined the column type, you can use d in the tabular environment, where the numeric argument is the number of figures to expect after the decimal point. You don’t have to use exactly that number of figures in every entry, just that that’s how much room it will leave.

% a tabular enviroment with a 1 and 3 figures after the decimal point column
begin{tabular}{d{1}d{3}}
1.6 & 1.657
\
2.0 & 6.563
\
7 & 6.26
\
end{tabular}

One annoying aspect of this package is that for the headers of that column, which probably aren’t numbers, you will need to use multicolumn to get them to display nicely.

% a tabular enviroment with a 1 and 3 figures after the decimal point column
begin{tabular}{d{1}d{3}}
multicolumn{1}{c}{Heading 1} & multicolumn{1}{c}{Heading 2}\
1.6 & 1.657
\
2.0 & 6.563
\
7 & 6.26
\
end{tabular}

You can mix the d column type with the usual l, r and p column types.

threeparttable

You can’t use footnote in a floating table. This is one of several packages that allow table footnotes in various ways.

usepackage{threeparttable}

threeparttable doesn’t cause tables to float on its own, so you usually want to wrap in a table command:

begin{table}

begin{threeparttable}

% Normal bits of your table go here, and use tnote{a} and
% tnote{b} and so to generate a note mark

begin{tablenotes}
tnote General note
tnote General note 2
tnote[a] Note for mark a
tnote[b] Note for mark b
end{tablenotes}

end{threeparttable}

caption{Caption goes here}
end{table}

Unfortunately you need to generate the a, b, c (or whatever) numbering manually.

The general tnote entries are useful for things like “Bold entries are highest in the column”, so that they don’t need to go in the caption.

Useful LaTeX packages: bibliography

I’m going to post a short series of entries on LaTeX packages I found useful while preparing the examination copy of my PhD thesis. Largely this is just so that there’s a reference if my wiki page goes away, but also because I think many people use LaTeX the way I use it, that is, I got wedded to a bunch of packages 10 years ago and never really looked around for more recent stuff.

Today’s entry is a pretty slow start: the bibliography packages I used are pretty standard.

natbib

This is one of the most sophisticated and widely used packages for Harvard-style references (ie, “(Surname, Year)” rather than “[1]” style references).

usepackage[round]{natbib}
bibliographystyle{plainnat}

Inside your text use citep for a reference in parentheses “(Surname, Year)”, and citet for a in-text reference “Surname (Year)”. Its important to note that the plain cite command is equivalent to citet, which you may not expect.

You can use citeauthor to get just “Surname” and citeyear to get just “Year”.

bibentry

This is a useful add-on to natbib, which allows you to insert full bibliography entries into the body of your text. This is useful in the declaration portion of a thesis (where you say something like “this thesis incorporates revised versions of the following published articles”).

usepackage{bibentry}
nobibliography*

Then later on when you want to insert a full bibliography entry into the middle of your text:

bibentry{citationkey}

Product review: Shoeboxed

Update February 2017: this service is now known as Squirrel Street, and their smallest monthly pricing is significantly higher than it was in 2012. However much of the review still applies.

Original review:

I’ve been using Shoeboxed now for long enough to review it, I think.

Problem: as with every adult household, we have lots of incoming documents like bills and super statements and similar, and the high initial overhead on deciding whether and where to store them, plus re-sorting them later and so on has never been something we’ve been on top of. Come tax time, in particular, we were usually opening piles of envelopes and hoping for the best.

In 2007 or 2008 we started scanning and shredding a lot of things, but that still left going through and labelling the scans as a problem, plus when I went on maternity leave in 2010 we didn’t have access to a sheet-feed scanner anymore and got behind and never caught up. Back to the “giant unsorted pile of paper” solution.

There are a few services that accept mail on behalf of people and send scans (Pass the Post, Keeping You Posted) but these tend to be quite expensive if you want them to handle all your mail, and also there’s still a time-critical decision step (scan it or send it to me). It tends to be aimed at travellers or businesses. It was annoying enough though that every few months I hit the search engines and eventually lit on Shoeboxed.

What Shoeboxed does:

  1. accepts documents either sent by mail (not one at a time, many in a big envelope) to a US or AU postal address, or uploaded
  2. scans the physical document if any
  3. does data entry for the major data within (for bills, say, the sender and the total)
  4. makes them available after logging in on their website
  5. makes them available over an API to other services like bookkeeping websites

What Shoeboxed doesn’t do:

  1. directly accept individual physical mail on your behalf (they do have a service where you can get online receipts sent to them, I haven’t used it)
  2. full OCR of the scanned documents

There’s a very very limited Free plan involving uploading (not mailing) up to 5 documents a month for OCR plus unlimited uploads if you do your own data entry. The next plan up in Australia, which we’re on, is $20 a month, and includes all the features I listed

Impressions:

  1. overall, it pretty much does what we want: gets paper out of our house and into an easily searchable online form with scans available
  2. because it isn’t fully OCRed I still have to go through non-bills in order to note what they are, eg, a mail from childcare could be a fee change or a newsletter or a note about illness and if I need to find it in a year I’d have to search on the name and look through them all
  3. the processing speed on the Lite plan (contents of envelopes appear on the website in 3–5 days) has been a bit annoying on occasion, I’ve found myself scanning really time-critical documents and uploading them
  4. the processing speed on uploaded scans is great, the data entry is usually done within the hour
  5. the usage reporting doesn’t incorporate the bonus scans one gets by doing things like signing up for an annual plan, or answering demographic surveys. Very annoying!

For our needs, it’s definitely an improvement over our home-rolled solution. We’re scrambling to get 250 documents to them before our annual purchase bonus expires.

Connecting a Debian/Ubuntu server to the Macquarie University OneNetAnywhere VPN

I realise that this is a rather specific problem, but hopefully the links I provide here will be useful for anyone wanting to access a PPTP VPN themselves.

I have to say that this is one of those entries more likely to be useful if you ever have this specific problem (eg, you can here via a search engine query for “argh pptp mppe errors argh argh argh”) and less for a casual reader. Apologies loyal fans!

Continue reading “Connecting a Debian/Ubuntu server to the Macquarie University OneNetAnywhere VPN”