Category Archives: Workflow

Filter segments in DVX by length using SQL

When working with a CAT tool, sometimes it is helpful to filter segments of a certain length. DVX is uniquely flexible when it comes to filtering by allowing you to apply SQL WHERE clauses to any part of the segment, including source text, target text or segment status. To enter a filter, you click on the Segment filter combo box above the translation grid and select “SQL Statement”. Then you can enter the desired SQL filtering clause. Read on to learn how to filter by text length (characters or words).

Filtering by character count is easy:
Len(Source) >= 100 filters segments with at least 100 source characters
Len(Target_XXXX) >= 100 filters segments with at least 100 target characters

Note that this includes tags (like {123}) in the count, which is probably not exactly what you want. For the source text, you can use Source2Index (see below for an explanation) instead of Source. For the target text, I know of no easy way of excluding codes.

Filtering by word count is less obvious as SQL doesn’t have a concept of words, so you either have to define in SQL language (the MS JET variant used by DVX) what a word is, or you have to use a trick I learned some time ago, probably from the DVX mailing list (sorry I cannot remember who to credit).

So how do you define a word in SQL? Unfortunately SQL is rather limited compared to regular expressions when it comes to manipulating text. If you define a word as a sequence of non-space characters that contain a letter (to exclude tags like {123}, you end up with this simplistic definition of a word:

One thing to keep in mind about [a-z] is that it doesn’t include non-ASCII characters. So if you have words that consist exclusively of non-ASCII characters, you’ll have to include them in the character class in the square brackets or they will not count as separate words. For instance (“ő” is a word in its own right in Hungarian, meaning “he” or “she”):

You can add as many of these word definitions as you wish. So to filter segments with 3 words in the source or longer, you’d use this:
Source Like '*[a-z]* *[a-z]* *[a-z]*'

Now on to the trick I promised you above. DVX stores a stripped-down version of each source segment in a separate column called Source2Index, used for operations like TM matching, sorting or filtering. This index field contains all words, but no tags and no leading and trailing non-alphabetical characters. So If you have this source:
[Scroll text] {5578} Define {5579}/{5580}
…then you get this in the Source_Index column:
Scroll Text] Define

Most of the time, this stripped version of the source is better suited for counting the words in the text than the raw text in the Source column. So you can either apply the WHERE clause above to the Source2Index column rather than the Source column, or you can take advantage of the fact that in Source2Index, DVX normalizes white space (replaces subsequent occurrences of white space with a single space). This means that the number of words in a segment is one higher than the number of spaces. This gives you the following SQL WHERE clause to filter segments based on word count:
To filter all segments with less than three words in the source:
Len(Pairs.Source2Index)-Len(Replace(Pairs.Source2Index," ", "")) <= 3

Update 7 Aug 2015:
Unfortunately, recent DVX versions restrict the use of "unsafe" functions, including Replace, so the above clause that uses Replace doesn't work any more. But you can still use the Source2Index column to filter for your own definition of spaces:
Source2Index Like '*[a-z]* *[a-z]* *[a-z]*'

How do you check your final output for errors?

Over the years, I have had the privilege of reviewing the output of hundreds of fellow technical writers and translators. Many of them were gifted professionals; some had a level of technical knowledge or linguistic skill that I can only hope to attain some day.

But one area where I had to adjust my expectations with growing experience was the incidence of typos, or more generally gaffes of any kind that are so immediately obvious that they can only be attributed to insufficient final QA before delivery rather than to a lack of knowledge.

After doing some scientifically woefully inadequate sampling, I now tip my hat in respect when I have to correct less than one such error per 1000 words. I am quite satisfied when I see less than one per 500 words, and grow increasingly disappointed only when going below that mark.

So what do I do myself to keep my self-respect in that regard – and to earn the respect of anyone checking my work?

I always re-read a sentence as soon as I have completed it; in fact I have it read out to my by a text-to-speech (TTS) application while following along on screen.

Text-to-speech has progressed amazingly over the last decade, and there are now remarkably naturally sounding TTS voices available for almost any language, along with fairly capable utilities to harness them. And reading a text while simultaneously listening to it does wonders to my proofreading efficiency as it engages not only one but two senses in the QA process.

After finishing a job, I re-read the full text, again with TTS assistance. If I make changes, I make sure to fully re-read (and listen to) every sentence I have changed.

Finally, I run the text through at least one, but in most cases two spellcheckers, usually Microsoft Office and the best available dictionary for the Hunspell engine (there are usually several available for any language). Hunspell is used, among others, in LibreOffice and newer incarnations of Adobe’s Creative Suite. When I once compared their performance in checking a word list of around one million word forms, I was amazed by how little overlap there was between the two engines in terms of false positives and, worse, false negatives.

I also take every opportunity to have my work proofread by a peer, as it is usually more efficient to proofread someone else’s work than one’s own (though the gap can be closed with training and letting one’s own text “lie” for some time before re-reading it).

For all the trouble I am going to, I am afraid I cannot yet report that my error rate has plummeted to zero. The last time I had it checked, I was averaging between one error per 2000 words (on bad days) and one error per 5000 words (on good days). But I’m working on it.

What QA steps do you take to make your clients happy and to avoid those embarrassing moments when you open up an old translation for reference (or retrieve a TM match that definitely cannot be blamed on anyone else) and the first thing that jumps on you is a glaring typo?