EQL - Version 2 of the Emu Query Language

2018-04-11

Introduction

This document introduces and defines version 2 of the Emu Query Language (EQL) and tries to show what it is capable of by giving numerous examples. The EQL is a query language that is aimed at speech and language researchers that is supposed to be easy to understand and learn yet expressive and powerful. It enables researchers to easily query annotation structures of databases stored in the emuDB format. The emuR package provides a query() function to query emuDBs that are loaded into the current R session (for more information see the emuR_intro as well as the emuDB vignettes). The main argument of the query() function is the query argument (query(..., query = "XXX", ...) where XXX is the query string). In this document we will be focusing soley on these query strings and how to compile them.

To revise what was already mentioned in the emuR_intro as well as the emuDB vignette: The annotation structure of an emuDB can be thought of as a graph. Each annotation consist of annotational units (called ITEMs) that are grouped together in an ordered array. Each ITEM can be linked to other ITEMs of other levels if an according linkDefinition is present in the emuDB. An exemplary excerpt of such an annotation can be seen below.

Example of simple hierarchy

Example of simple hierarchy

As it is not the focus of this vignette: one thing to note about the query() function are the parameters bundlePattern and sessionPattern. These can be used to restrict which session and bundle the query will be run against. They both expect a regular expression string to restrict the sessions or bundles one wishes to query.

Examples

We will now jump right in by giving you a bunch of examples of query strings that where adapted from (Harrington and Cassidy 2002, @cassidy_harrington:2001). To have some data we can play with let us create a demo database and then load it into our current R session:

# load the package
library(emuR)

# create demo data in folder provided by the tempdir() function
create_emuRdemoData(dir = tempdir())

# get the path to emuDB called 'ae' that is part of the demo data
path2folder = file.path(tempdir(), "emuR_demoData", "ae_emuDB")

# load emuDB into current R session
ae = load_emuDB(path2folder)

Simple equality / inequality / matching / non-matching queries (single argument)

The syntax of a simple equality / inequality / matching / non-matching query is "[L OPERATOR A]" where “L” specifies a level (or alternatively the name of a parallel attributeDefinitions), “OPERATOR” is one of the following operators: “==” (equality); “!=” (inequality); “=~” (matching) or “!~” (non-matching) and “A” is an expression specifying the labels of the ITEMs of “L”.

Example Q & A’s:

INFO: The above examples use three operators that are new to the EQL as of version 2. One being the “==” equal operator that has the same meaning as the “=” operator of the EQL1 (which is also still available) while providing a cleaner more precise syntax. The other two being the “=~” and “!~” which are the new matching and non-matching regular expression operators.

If you wish to use parenthesis, blanks or characters that represent operands used by the EQL (see EBNF) as part of a label matching string (the string on the right hand side of one of the operands mentioned above), you must place this string in additional single quotation marks to escape these characters. For example searching for the items containing the labels O_&quot on the Phonetic level could not be written as "[Phonetic == O_&quot]" but would have to be written as "[Phonetic == 'O_&quot']". Note that reversing the single vs. double quotation mark order is currently not supported i.e. ‘[Phonetic == “O_&quot”]’ won’t lead to the desired behavior. Only use double quotation marks for the outer wrapping of the query string to avoid this issue.

Sequence queries using the “->” sequence operator

The syntax of a query string using the “->” sequence operator is "[L == A -> L == B]" where ITEM “A” on level “L” precedes ITEM “B” on level “L”. For a sequential query to work both arguments must be on the same level (alternatively parallel attributeDefinitions of the same level may also be chosen).

Example Q & A’s:

Subsequent sequence queries using nesting of the “->” sequence operator

The general strategy to constructing a query string that retrieves subsequent sequences of labels is to nest multiple sequences while paying close attention to the correct placement of the parentheses. An abstracted version of such a query string for the subsequent sequence of arguments A1, A2, A3, A4 would be: "[[[[A1 -> A2] -> A3] -> A4] -> A5]" where each argument (e.g. “A1”) represents an equality / inequality / matching / non-matching expression on the same level (alternatively parallel attributeDefinitions of the same level may also be chosen).

Example Q & A’s:

  • Q: What is the query to retrieve all sequences of ITEMs containing the labels “@”, “n” and “s” on the “Phonetic” level?
  • A:
  • Q: What is the query to retrieve all sequences of ITEMs containing the labels “to”, “offer” and “any” on the “Text” level?
  • A:
  • Q: What is the query to retrieve all sequences of ITEMs containing labels “offer” followed by two arbitrary labels followed by “resistance”?
  • A: ``

INFO: As the EQL1 didn’t have a regular expression operator, users often resorted to using queries such as “[Phonetic != XXX]” (where XXX is a label that was not part of the label set of the “Phonetic” level) to match every label on the “Phonetic” level. Although this is still possible in the EQL2, we strongly recommend using regular expressions as they provide a much clearer and preciser syntax and are less error prone.

Conjunction operator &

The syntax of a query string using the conjunction operator can schematically be written as: "[L == A & L_a2 == B & L_a3 == C & L_a4 == D & ... & L_an == N]" where ITEMs on level “L” have the label “A” (technically belonging to the first attribute of that level i.e. L_a1, which per default has the same name as it’s level) also have the attributes “B”, “C”, “D”, …, “N”. Same as with the sequence operator all expressions must be on the same level (i.e. parallel attributesDefinitions of the same level indicated by the a2 - an may to be chosen).

The conjunction operator is used to combine query conditions on the same level. This makes sense in two cases:

  1. to combine different attributes of the same level: "[phonetic == l & sonorant == T]" when ‘sonorant’ is an additional attribute of level ‘phonetic’.
  2. To combine a basic queries with a function (see sections Position and Count below): "[phonetic == l & Start(word, phonetic) == 1]"

Example Q & A’s:

Domination operator ^ (== hierarchical queries)

A schematic representation of a simple domination query string that retrieves all ITEMs “A” of level “L1” that are dominated by i.e. are directly or indirectly linked to ITEMs “B” in level “L2” would be "[L1 == A ^ L2 == B]". The dominates operator is not directional, meaning that either ITEMs in “L1” dominate ITEMs in “L2” or ITEMs in “L2” dominate ITEMs in “L1”. Note that linkDefinitions that specify the validity of the domination have to be present in the emuDB for this to work (see emuDB vignette for details).

Simple Domination

Example Q & A’s:

  • Q: What is the query to retrieve all ITEMs containing the label “p” in the “Phoneme” level that occur in strong syllables (i.e. dominated by / linked to ITEMs of the level “Syllable” that contain the label “S”)?
  • A:
  • Q: What is the query to retrieve all syllable ITEMs which contain a phoneme ITEM labeled “p”?
  • A:
  • Q: What is the query to retrieve all syllable ITEMs which neither contain a phoneme ITEM labeled “k” nor “p” nor “t”?
  • A:

INFO: Even though the domination operator is not directional, what you place to the left and to the right of the operator does have an impact on the result. If no result modifier (the hash tag “#”) is used the query engine will automatically assume that the expression to the left of the operator specifies what is to be returned. This means that the schematic query string: "[L1 == A ^ L2 == B]" is semantically equal to the query string: "[#L1 == A ^ L2 == B]". As it is more explicit to mark the desired result we recommend you always use the result modifier where possible.

Multiple Domination

The general strategy to constructing a query string that specifies multiple domination relations of ITEMs is to nest multiple domination expressions while paying close attention to the correct placement of the parentheses. A dominance relationship sequence or the arguments “A1”, “A2”, “A3”, “A4”, can therefore be noted as: "[[[[A1 ^ A2] ^ A3] ^ A4] ^ A5]" where “A1” is dominated by “A2” and “A3” and so on.

Example Q & A’s:

  • Q: What is the query to retrieve all ITEMs on the “Phonetic” level that are part of a strong syllable (labeled “S”) and belong to the words “amongst” or “beautiful”?
  • A: "
  • Q: The same as the question above but this time we want the Text ITEMs.
  • A: "[[Pitch_Accent == H* ^ Phoneme == p] ^ #Text == price | space]"

Position

The EQL has three function terms to specify where in a dominance relationship a child level ITEM is allowed to occur. The three function terms are “Start()”, “End()” and “Medial()”.

Simple usage of Start(), End() and Medial()

A schematic representation of a query string representing a simple usage of the Start(), End() and Medial() function would be: "POSFCT(L1, L2) == 1" or "POSFCT(L1, L2) == TRUE". In this representation “POSFCT” is a placeholder for one of the three function where the level “L1” has to dominate level “L2”. The “== 1” / “== TRUE” part of the query string indicates that if a match is found (match is TRUE or “== 1”) then the according ITEM of the level “L2” is returned. If this expression is set to “== 0” / “== FALSE” (FALSE), all the ITEMs that do not match the condition of “L2” will be returned. For a visualization of what is returned by the various options of the three functions see the illustration below.

Illustration of what is returned by the Start(), Medial() and End() functions

Illustration of what is returned by the Start(), Medial() and End() functions

INFO: As using 1 and 0 for TRUE and FALSE is not that intuitive to most R users, the EQL version 2 optionally allows for the values TRUE / T and FALSE / F to be used instead of 1 and 0. This syntax should be more familiar to most R users.

Example Q & A’s:

  • Q: What is the query to retrieve all word-initial syllables?
  • A:
  • Q: What is the query to retrieve all word-initial phonemes?
  • A:
  • Q: What is the query to retrieve all non-word-initial syllables?
  • A:
  • Q: What is the query to retrieve all word-final syllables?
  • A:
  • Q: What is the query to retrieve all word-medial syllables?
  • A:

Position and Boolean &

The syntax for combining a position function with the boolean operator is "[L == E & Start(L, L2) == 1]" where ITEM “E” on level “L” occurs at the beginning of the ITEM “L”. Once again “L” has to dominate “L2” ( optionally parallel attributeDefinitions of the same level may also be chosen).

Example Q & A’s:

  • Q: What is the query to retrieve all “n” phonemes at the beginning of a syllable?
  • A:
  • Q: What is the query to retrieve all word-final “m” phonemes?
  • A:
  • Q: What is the query to retrieve all non-word-final “S” syllables?
  • A:

Position and Boolean ^

The syntax for combining a position function with the boolean hierarchical operator is "[L == E ^ Start(L1, L2) == 1]" where level “L” and level “L2” refer to different levels where either “L” dominates “L2”, or “L2” dominates “L”.

Example Q & A’s:

  • Q: What is the query to retrieve all “p” phonemes, which occur in the first syllable of the word?
  • A:
  • Q: What is the query to retrieve all phonemes, which do not occur in the last syllable of the word?
  • A:

Count

A schematic representation of a query string utilizing the count mechanism would be: "[Num(L1, L2) == N]" where “L1” contains “N” number of ITEMs in “L2”. For this type of query to work “L1” has to dominate “L2”. As the query matches a number (“N”) it is also possible to use the operators > (more than), < (less than) and != (not equal). The resulting segment list contains ITEMs of “L1”.

Example Q & A’s:

Count and Boolean &

A schematic representation of a query string combining the count and the boolean operators would be: "[L == E & Num(L1, L2) == N]" where ITEMs “E” on level “L” are dominated by “L1” and “L1” contains “N” number of “L2” Items. Further “L1” dominates “L2” under the condition that “L” and “L1” (not “L2”) refer to the same level (parallel attributeDefinitions of the same level may also be chosen).

Example Q & A’s:

  • Q: What is the query to retrieve the “Text” of all words, which consist of more than 5 phonemes?
  • A:
  • Q: What is the query to retrieve all strong syllables that contain 5 phonemes?
  • A:

Count and ^

A schematic representation of a query string combining the count and the boolean operators would be: "[L == E ^ Num(L1, L2) == N]" where ITEMs “E” on level “L” are dominated by “L1” and “L1” contains “N” number of “L2” ITEMs. Further “L1” dominates “L2” under the condition that “L” and “L1” do not refer to the same level.

Example Q & A’s:

  • Q: What is the query to retrieve all “m” phonemes in 3-syllable words?
  • A: ``
  • Q: What is the query to retrieve all “W”-Syllables in words of 3 or less syllables?
  • A:
  • Q: What is the query to retrieve all words, which contain syllable, which contain 4 phonemes
  • A:

Combinations

^ and -> (Domination and Sequence)

A schematic representation of a query string combining the domination and the sequence operators would be: "[[A1 ^ A2] -> A3]" where “A1” and “A3” refer to the same level (parallel attributeDefinitions of the same level may also be chosen).

Example Q & A’s:

  • Q: What is the query to retrieve all “m” preceding “p” and “m” is part of a “S”-syllable?
  • A:
  • Q: What is the query to retrieve all “s” preceding “t” and “t” is part of a “W”-syllable?
  • A:
  • Q: What is the query to retrieve all “S”-syllables, which contain phoneme “s” and precede a “S”-syllable?
  • A:
  • Q: Same question as the question above but this time we want all “s” ITEMs where “s” is part of a “S”-syllable and this “S”-syllable precedes a “S”-syllable.
  • A: "[[Phoneme == s ^ Syllable == S] -> Syllable == S]" this will cause an error as Phoneme == s and Syllable == S are not on the same level. Therefore the correct answer is:

^ and -> and & (Domination and Sequence and Boolean &)

Example Q & A’s:

  • Q: What is the query to retrieve the “Text” of all words, beginning with a schwa?
  • A:
  • Q: What is the query to retrieve all word-initial “m” ITEMs in a strong syllable, which precedes “o:”?
  • A:
  • Q: Same question as the question above but this time we want the text.
  • A:
  • Q: What is the query to retrieve the text of all three-syllable words that also precede the word “the”, which contain a schwa in the first syllable?
  • A: As this is a large multi part question, let’s break it down: - 1.) The text of all three-syllable words : "[Text =~ .* & Num(Text, Syllable) == 3]" - 2.) A schwa occurs in the first syllable: "[Phoneme == @ ^ Start(Word, Syllable) == 1]" - 3.) The text is “the”: "[Text == the]" - Let’s now combined all three by saying "[1. ^ 2.]" and these are followed by three ("[1. ^ 2.] -> 3.]"):

A few more Q & A’s (because practice makes perfect)

Differences and incompatibilities to legacy EMU query language (R package ‘emu’, version 4.2)

In this section we will try to give a quick overview of the major changes concerning the query mechanics of emuR compared to the legacy R package emu in the version 4.2. This section is mainly meant for people transitioning to emuR from the legacy system.

Function call syntax

In emuR it is required that a emuDB is loaded into your current R session before being able to use the query() function. This is achieved using the load_emuDB() function (see emuR_intro vignette for details). This was not necessary using the legacy emu.query() function.

Example calls to the query() function (prerequisite: a loaded emuDB called “andosl”):

Result type

The new default result type of a query is an object of the S3 class “emuRsegs”. This class inherits from the legacy EMU class “emusegs” and the well known “data.frame” class. This means it is fully compatible to the legacy “emusegs” class, while containing some additional data, for example the ID’s of the start and end ITEMs of each segment list row. Each row of this “data.frame” is a sequence of one or more annotational units (i.e. ITEMs) on a single level. For more information about this object see help(emuRsegs).

The query function of emuR returns an empty segment list (row count is zero) if the query does not match any ITEM. If the legacy EMU function emu.query() didn’t find any matches it would throw an error with the message: "Can't find the query results in emu.query: there may have been a problem with the query command.".

Bundle (utterance) names

The emuDB format used by the emuR package introduces the concept of bundles that are grouped together in sessions (see emuDB vignette for further details). As legacy EMU databases did not have the concept of a session, all the utterances of a legacy database are place in a single default session called “0000”. Therefore the “utts” column of a segment list is prefixed by the session name for example “0000:msajc003” instead of just being “msajc003” as in the legacy system.

The result modifier hash tag ‘#’

Compared to the legacy EMU system which allowed multiple occurrences of the hash tag “#” to be present in a query string, the query() function only allows a single result modifier. This assures that only consistent result sets are returned. If you however desire to have multiple result sets in one segment list, we recommend you simply concatenate the result sets of separate queries using the rbind() function.

Interpretation of the hash tag “#” in conjunction operator queries

legacy EMU

moving data from Tcl to R
Read 1 records
segment  list from database:  andosl 
query was:  [Text=spring & #Accent=S] 
  labels    start      end     utts
1 spring 2288.959 2704.466 msajc094
moving data from Tcl to R
Read 1 records
segment  list from database:  andosl 
query was:  [#Text=spring & #Accent=S] 
  labels    start      end     utts
1 spring 2288.959 2704.466 msajc094
"

The hash tag “#” had no effect.

emuR

segment  list from database:  andosl 
query was:  [Text=spring & #Accent=S] 
  labels    start      end          utts
1      S 2288.975 2704.475 0000:msajc094

Returns the same ITEM, but with the label of the hashed attributeDefinition name. The second legacy example is not a valid emuR query (two hash tags).

Error in query.database.eql.KONJA(dbConfig, qTrim) : 
Only one hashtag allowed in linear query term: #Text=spring & #Accent=S 

The query() function throws an error as it would be necessary to return each item twice to get both the “Text” and “Accent” labels.

Bugs in legacy EMU function emu.query

Alternative labels in inequality queries

Example:

legacy EMU

moving data from Tcl to R
Read 4 records
segment  list from database:  ae 
query was:  [Text!=beautiful|futile ^ Phoneme=u:] 
     labels    start      end     utts
1       new  475.802  666.743 msajc057
2    futile  571.999 1091.000 msajc010
3        to 1091.000 1222.389 msajc010
4 beautiful 2033.739 2604.489 msajc003

We assume that the OR operator “|” was simply ignored when used in conjunction with the inequality operator “!=”.

emuR

segment  list from database:  ae 
query was:  [Text!=beautiful|futile ^ Phoneme=u:] 
  labels    start      end          utts
1     to 1091.025 1222.375 0000:msajc010
2    new  475.825  666.725 0000:msajc057

Errors caused by missing or superfluous blanks / parenthesis

Certain queries in the legacy EMU system required blanks around certain operators to be present or absent as well as certain parenthesis to be present or absent. If this was not the case the legacy query engine sometimes threw cryptic errors and sometimes even crashed and in the worst cases took the entire R session with it. The query engine of the emuR package is much more robust against missing or superfluous blanks / parenthesis.

Order of result segment list

For the legacy EMU query it was never explicitly defined, at least to our knowledge, if and how the resulting segment list was ordered. If the result type of the query() function is set to "emuRsegs" the resulting list is ordered by UUID, session, bundle and sample start position. If it is set to "emusegs" the resulting list is ordered by the fields utts and start.

Additional features

  • The query mechanics of emuR accepts the double equal character string “==” (recommended) as well as the single “=” equal character string as an equal operator.
  • The EQL2 has the capability to query labels by matching regular expressions using the ‘=~’ (matching) and ‘!~’ (non-matching) operators.
    • For example: query("andosl", "Text =~ .*tz.*")

Extended Backus–Naur Form (EBNF)

EBNF adapted from (John 2012). As the original EBNF was formulated in German a few of the abbreviation terms (e.g. “DOMA” is the abbreviation for the German term “Dominanzabfrage”) where translated into English abbreviations (e.g. “DOMQ” is the abbreviation for the English term “dominance query”).

Terminal symbols of EQL2 (operators) and their meaning.

The terminal symbols described below are ordered descending by their binding priority.

Symbol Meaning
# Result modifier (projection)
, Parameter list separator
== Equality (new in version 2 of the EQL; added for cleaner syntax)
= Equality (optional; for backwards compatibility)
!= inequality
=~ Regular expression matching
!~ Regular expression non-matching
> Greater than
>= Equal or greater than
< Less than
>= Equal or less than
| Alternatives separator
& Conjunction of equal rank
^ Dominance conjunction
-> Sequence operator

Terminal symbols of EQL2 (brackets) and their meaning.

Symbol Meaning
' Quotes literal string
( Function parameter list begin
) Function parameter list end
[ Sequence or dominance enclosing begin bracket
] Sequence or dominance enclosing end bracket

Terminal symbols of EQL2 (functions) and their meaning

Symbol Meaning
Start Start
Medial Medial
End Final
Num Count

Formal description of EMU query language EQL2

EBNF term Abriviation Conditions
EQL = CONJQ | SEQQ | DOMQ; EMU Query Language
DOMQ = "[", ( CONJQ | DOMQ | SEQQ ), "^", ( CONJQ | DOMQ | SEQQ ), "]"; dominance query levels must be hierarchically associated
SEQQ = "[", ( CONJQ | SEQQ | DOMQ ), "->", ( CONJQ | SEQQ | DOMQ ), "]"; sequential query levels must be linearly associated
CONJQ = { "[" }, SQ, { "&", SQ }, { "]" }; conjunction query levels must be linearly associated
SQ = LABELQ | FUNCQ; simple query
LABELQ = [ "#" ], LEVEL, ( "=" | "==" | "!=" | "=~" | "!~" ), LABELALTERNATIVES; label query
FUNCQ = POSQ | NUMQ; function query
POSQ = POSFCT, "(", LEVEL, ",", LEVEL, ")", "=", "0" | "1" | "TRUE" | "FALSE"; position query levels must be hierarchically associated; second level determines semantic
NUMQ = "Num", "(", LEVEL, ",", LEVEL, ")", COP, INTPN; number query levels must be hierarchically associated; first level determines semantic
LABELALTERNATIVES = LABEL , { "|", LABEL }; label alternatives
LABEL = LABELING | ( "'", LABELING, "'" ); label levels must be part of the database structure; LABELING is an arbitrary character string or a label group class configured in the emuDB; result modifier ‘#’ may only occur once
POSFCT = "Start" | "Medial" | "End"; position function
COP = "=" | "==" | "!=" | ">" | "<" | "<=" | ">="; comparison operator
INTPN = "0" | INTP; integer positive with n**ull
INTP = DIGIT-"0", { DIGIT }; integer positive
DIGIT = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"; digit

INFO: The LABELING term used in the LABEL EBNF term can represent any character string that is present in the annotation. As this can be any combination of Unicode characters we chose not to explicitly list them as part of the EBNF.

Restrictions

A query may only contain a single result modifier “#” (hash tag)

References

Cassidy, Steve, and Jonathan Harrington. 2011. “Multi-Level Annotation in the Emu Speech Database Management System.” Speech Communication, no. 33:61–78.

Harrington, Jonathan, and Steve Cassidy. 2002. “The Emu-Query Language (Anhang).” IPDS Kiel.

John, Tina. 2012. “Emu Speech Database System.” PhD thesis, LMU-Munich.