Filter segments in DVX by length using SQL

When working with a CAT tool, sometimes it is helpful to filter segments of a certain length. DVX is uniquely flexible when it comes to filtering by allowing you to apply SQL WHERE clauses to any part of the segment, including source text, target text or segment status. To enter a filter, you click on the Segment filter combo box above the translation grid and select “SQL Statement”. Then you can enter the desired SQL filtering clause. Read on to learn how to filter by text length (characters or words).

Filtering by character count is easy:
Len(Source) >= 100 filters segments with at least 100 source characters
Len(Target_XXXX) >= 100 filters segments with at least 100 target characters

Note that this includes tags (like {123}) in the count, which is probably not exactly what you want. For the source text, you can use Source2Index (see below for an explanation) instead of Source. For the target text, I know of no easy way of excluding codes.

Filtering by word count is less obvious as SQL doesn’t have a concept of words, so you either have to define in SQL language (the MS JET variant used by DVX) what a word is, or you have to use a trick I learned some time ago, probably from the DVX mailing list (sorry I cannot remember who to credit).

So how do you define a word in SQL? Unfortunately SQL is rather limited compared to regular expressions when it comes to manipulating text. If you define a word as a sequence of non-space characters that contain a letter (to exclude tags like {123}, you end up with this simplistic definition of a word:
'*[a-z]*'

One thing to keep in mind about [a-z] is that it doesn’t include non-ASCII characters. So if you have words that consist exclusively of non-ASCII characters, you’ll have to include them in the character class in the square brackets or they will not count as separate words. For instance (“ő” is a word in its own right in Hungarian, meaning “he” or “she”):
'*[a-zŐő]*'

You can add as many of these word definitions as you wish. So to filter segments with 3 words in the source or longer, you’d use this:
Source Like '*[a-z]* *[a-z]* *[a-z]*'

Now on to the trick I promised you above. DVX stores a stripped-down version of each source segment in a separate column called Source2Index, used for operations like TM matching, sorting or filtering. This index field contains all words, but no tags and no leading and trailing non-alphabetical characters. So If you have this source:
[Scroll text] {5578} Define {5579}/{5580}
…then you get this in the Source_Index column:
Scroll Text] Define

Most of the time, this stripped version of the source is better suited for counting the words in the text than the raw text in the Source column. So you can either apply the WHERE clause above to the Source2Index column rather than the Source column, or you can take advantage of the fact that in Source2Index, DVX normalizes white space (replaces subsequent occurrences of white space with a single space). This means that the number of words in a segment is one higher than the number of spaces. This gives you the following SQL WHERE clause to filter segments based on word count:
To filter all segments with less than three words in the source:
Len(Pairs.Source2Index)-Len(Replace(Pairs.Source2Index," ", "")) <= 3

Update 7 Aug 2015:
Unfortunately, recent DVX versions restrict the use of "unsafe" functions, including Replace, so the above clause that uses Replace doesn't work any more. But you can still use the Source2Index column to filter for your own definition of spaces:
Source2Index Like '*[a-z]* *[a-z]* *[a-z]*'

„Mehrwertsteuer“ richtig abkürzen

Mehrwertsteuer ist ein langes, zusammengesetztes, emotional eher negativ1 besetztes Wort, und als solches schreit es geradezu danach, abgekürzt zu werden. Aber wie? Möglichkeiten gibt es genug:

MwSt. / Mw-St. / Mwst. / MWST. / MWSt. / MWS.

Unter uns gesagt: Es ist fast egal, für welche Variante Sie sich entscheiden, wenn Sie sich nur konsequent daran halten und für den Fall, dass ein Kunde kritisch nachfragt, eine einigermaßen plausible Begründung für Ihre Wahl parat haben.

Eine solche Begründung kann zum Beispiel sein, dass im Duden-Rechtschreibwörterbuch die erste Variante – MwSt. – empfohlen und die zweite Variante – Mw.-St. – zumindest erwähnt wird. Für die dritte und vierte Variante – Mwst. und MWST. – spricht der Verzicht auf die Binnengroßschreibung. Die beiden letzten Varianten stammen aus der guten alten Zeit vor der Rechtschreibreform und wirken zunehmend veraltet.

Wichtig ist, den Punkt nach der Abkürzung nicht einzusparen, es sei denn, Sie arbeiten an einem Text zum Steuerrecht – dort hat sich die Nutzung von Abkürzungen mit Binnengroßschreibung und ohne Punkt eingebürgert: MwSt, USt, ESt.

Apropos Steuerrecht: In Deutschland und Österreich kennt das Steuerrecht den Begriff Mehrwertsteuer gar nicht. Dort wird vielmehr nach Maßgabe des deutschen und österreichischen Umsatzsteuergesetzes die – Sie haben es erraten – Umsatzsteuer (USt.) – erhoben.

Die „gesetzliche Mehrwertsteuer“ ist also ein Contradictio in Adjecto und sollte daher beim Lektorat nicht unkommentiert bleiben – auch wenn selbst im Webshop des Duden-Verlags davon zu lesen ist.

Damit lautet die richtige Antwort auf die eingangs gestellte Frage zumindest in Deutschland und Österreich: USt.


  1. Es sei denn, man ist deutscher Finanzminister – denn die Mehrwertsteuer trug in Deutschland im Jahr 2013 23,9% zu den gesamten Steuereinnahmen bei und lag damit an zweiter Stelle der einträglichsten Steuerarten, ganz knapp hinter der Lohnsteuer (25,5%). Einkommenssteuer (6,8%) und Körperschaftssteuer (3,1%) sind übrigens nur auf den hinteren Plätzen zu finden.

Higher word counts, lower rates?

Occasionally, I receive enquiries that request discounts based on job size. Is this justified?

Among my freelancing colleagues, opinions go both ways. Some accept the premise that securing a large job is worth a discount, while others point out that regardless of the size of a job, they still have to process every single word with the same care, and word count therefore has no bearing on their rates.

There are valid arguments for both sides, and to me, the answer boils down to a difference between theory and practice.

In theory, one big job is in many ways better than several smaller ones. You save on various overhead activities like administration, invoicing or marketing. You can plan ahead better than with several small jobs arriving in succession. If the text is homogeneous enough, you might save on research compared to several jobs from different domains.

In practice, however, this is just one aspect of overall pricing, and even in the ideal case when the advantages mentioned above come to bear, their impact is generally dwarfed by more important considerations like subject matter, availability of reference material, quality of writing of the source text and the possibility of approaching the client for clarifications. This is particularly true for well-established freelancers with streamlined internal processes and a regular stream of work. For them, the administrative overhead of individual jobs is minimal, and a number of practical drawbacks often negate any upsides big jobs might have.

Clients’ desired turnaround times usually do not increase proportionally with word count. So you usually have less (sometimes a lot less) time for one big job than for several smaller jobs with an equivalent word count. This can vitally impact a freelancer’s “room for manoeuvre” needed to accommodate jobs coming in at short notice from regular customers.

Big assignments are also often the ones that need to be split up among several translators, incurring significant management and communication overhead to ensure an end result that is consistent both in terminology and style.

In consequence, the simple equation big job = big discount doesn’t really hold up; there are several factors that have a much larger impact on a freelancer’s rate calculations. So next time you receive this request, be sure to take them into consideration before agreeing to a discount you may come to regret – either that same day when another client comes knocking with an urgent project, or later on when you find out that translating 10k words isn’t really that much quicker than translating 5x 2000 words.

And if as a client, budget constraints force you to look for every penny that can be saved, next time maybe you can find a way to offer your trusted translator a few additional days in lead time – they will most certainly appreciate it and may even offer you a discount in exchange.

Excel spellchecking – watch out for false negatives

When doing a spell check in Excel1, you should be aware that it accepts any word as correct as long as it contains a character that is outside the code page of the language checked.

So if you happen to misspell nämlich as namlich in German, Excel correctly flags your typo. However, if you are unlucky enough to spell it as nămlich, Excel happily absolves you. If this example seems contrived to you, think about foreign names. Excel leaves you on your own there, even if the correct form is in your spelling dictionary.

As long as you are aware of the issue, it shouldn’t be a major problem – of course you never rely on spell-checking only – or do you?

While I do use spell-checking as a safety net (a double one in fact by running my output through two spell-checking engines whenever possible), my primary means of living up to my own zero error tolerance is careful proofreading after completion – paced by a text-to-speech program that reads the text out to me and provides two major benefits: it prevents me from reading too fast and it alerts me to particularly surreptitious typos (long words, many consonants) by tripping over their pronunciation.


  1. Checked with Excel 2010. Let me know in the comments if you have seen the issue in later versions.

How do you check your final output for errors?

Over the years, I have had the privilege of reviewing the output of hundreds of fellow technical writers and translators. Many of them were gifted professionals; some had a level of technical knowledge or linguistic skill that I can only hope to attain some day.

But one area where I had to adjust my expectations with growing experience was the incidence of typos, or more generally gaffes of any kind that are so immediately obvious that they can only be attributed to insufficient final QA before delivery rather than to a lack of knowledge.

After doing some scientifically woefully inadequate sampling, I now tip my hat in respect when I have to correct less than one such error per 1000 words. I am quite satisfied when I see less than one per 500 words, and grow increasingly disappointed only when going below that mark.

So what do I do myself to keep my self-respect in that regard – and to earn the respect of anyone checking my work?

I always re-read a sentence as soon as I have completed it; in fact I have it read out to my by a text-to-speech (TTS) application while following along on screen.

Text-to-speech has progressed amazingly over the last decade, and there are now remarkably naturally sounding TTS voices available for almost any language, along with fairly capable utilities to harness them. And reading a text while simultaneously listening to it does wonders to my proofreading efficiency as it engages not only one but two senses in the QA process.

After finishing a job, I re-read the full text, again with TTS assistance. If I make changes, I make sure to fully re-read (and listen to) every sentence I have changed.

Finally, I run the text through at least one, but in most cases two spellcheckers, usually Microsoft Office and the best available dictionary for the Hunspell engine (there are usually several available for any language). Hunspell is used, among others, in LibreOffice and newer incarnations of Adobe’s Creative Suite. When I once compared their performance in checking a word list of around one million word forms, I was amazed by how little overlap there was between the two engines in terms of false positives and, worse, false negatives.

I also take every opportunity to have my work proofread by a peer, as it is usually more efficient to proofread someone else’s work than one’s own (though the gap can be closed with training and letting one’s own text “lie” for some time before re-reading it).

For all the trouble I am going to, I am afraid I cannot yet report that my error rate has plummeted to zero. The last time I had it checked, I was averaging between one error per 2000 words (on bad days) and one error per 5000 words (on good days). But I’m working on it.

What QA steps do you take to make your clients happy and to avoid those embarrassing moments when you open up an old translation for reference (or retrieve a TM match that definitely cannot be blamed on anyone else) and the first thing that jumps on you is a glaring typo?