Filter segments in DVX by length using SQL

When working with a CAT tool, sometimes it is helpful to filter segments of a certain length. DVX is uniquely flexible when it comes to filtering by allowing you to apply SQL WHERE clauses to any part of the segment, including source text, target text or segment status. To enter a filter, you click on the Segment filter combo box above the translation grid and select “SQL Statement”. Then you can enter the desired SQL filtering clause. Read on to learn how to filter by text length (characters or words).

Filtering by character count is easy:
Len(Source) >= 100 filters segments with at least 100 source characters
Len(Target_XXXX) >= 100 filters segments with at least 100 target characters

Note that this includes tags (like {123}) in the count, which is probably not exactly what you want. For the source text, you can use Source2Index (see below for an explanation) instead of Source. For the target text, I know of no easy way of excluding codes.

Filtering by word count is less obvious as SQL doesn’t have a concept of words, so you either have to define in SQL language (the MS JET variant used by DVX) what a word is, or you have to use a trick I learned some time ago, probably from the DVX mailing list (sorry I cannot remember who to credit).

So how do you define a word in SQL? Unfortunately SQL is rather limited compared to regular expressions when it comes to manipulating text. If you define a word as a sequence of non-space characters that contain a letter (to exclude tags like {123}, you end up with this simplistic definition of a word:
'*[a-z]*'

One thing to keep in mind about [a-z] is that it doesn’t include non-ASCII characters. So if you have words that consist exclusively of non-ASCII characters, you’ll have to include them in the character class in the square brackets or they will not count as separate words. For instance (“ő” is a word in its own right in Hungarian, meaning “he” or “she”):
'*[a-zŐő]*'

You can add as many of these word definitions as you wish. So to filter segments with 3 words in the source or longer, you’d use this:
Source Like '*[a-z]* *[a-z]* *[a-z]*'

Now on to the trick I promised you above. DVX stores a stripped-down version of each source segment in a separate column called Source2Index, used for operations like TM matching, sorting or filtering. This index field contains all words, but no tags and no leading and trailing non-alphabetical characters. So If you have this source:
[Scroll text] {5578} Define {5579}/{5580}
…then you get this in the Source_Index column:
Scroll Text] Define

Most of the time, this stripped version of the source is better suited for counting the words in the text than the raw text in the Source column. So you can either apply the WHERE clause above to the Source2Index column rather than the Source column, or you can take advantage of the fact that in Source2Index, DVX normalizes white space (replaces subsequent occurrences of white space with a single space). This means that the number of words in a segment is one higher than the number of spaces. This gives you the following SQL WHERE clause to filter segments based on word count:
To filter all segments with less than three words in the source:
Len(Pairs.Source2Index)-Len(Replace(Pairs.Source2Index," ", "")) <= 3

Update 7 Aug 2015:
Unfortunately, recent DVX versions restrict the use of "unsafe" functions, including Replace, so the above clause that uses Replace doesn't work any more. But you can still use the Source2Index column to filter for your own definition of spaces:
Source2Index Like '*[a-z]* *[a-z]* *[a-z]*'

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.