8 min. reading time

In the first part of this series we defined a Xtext grammar based on boilerplates in order to control the use of natural language and create acceptable requirements as they are written. Another approach to improve the quality of textual requirements is the use of Natural Language Processing (NLP) techniques to control their quality in terms of grammar and vocabulary after they have been written. Read on to learn more!

The NLP technique related to this post is Part-Of-Speech (POS) tagging which categorizes the tokens (words or phrases) of a sentence into different types like verbs or nouns. Due to the fixed grammatical structure of the boilerplates defined by our Xtext grammar, an more expensive grammatical analysis using NLP techniques is not necessary.

We will use POS-tagging to define constraints for the use of domain specific concepts in the free text parts of the boilerplates. In our approach these concepts are objects and functions which appear as nouns and verbs in the free text parts of the boilerplates.

According to our grammar that means that each RequirementEnd must contain exactly one verb representing a Function and at least one noun which represents an DomainObject in our Glossary.

Model-to-text transformation

In order to analyse the boilerplates using NLP techniques, we have to implement a model-to-text transformation which transforms the boilerplates from their representation in the model to plain text. Therefore, we implement a new class which contains toString methods for the different elements of the boilerplates. Due to the structure of the rules we defined in part one of this series, the transformation is rather simple. The following example shows the toString method for the type TextWithReferences.

def toString(TextWithReferences text) {
    if (!text.onlyRefs.empty) {
        return text.onlyRefs.map[name].join(" ").trim
    }
    val elements = newLinkedList
    elements.addAll(text.refBefore.map[name])
    elements.addAll(text.text)
    text.after.forEach [ comb |
        elements.addAll(comb.refs.map[name])
        elements.addAll(comb.text)
    ]
    elements.addAll(text.finalRef.map[name])
    elements.join(" ").trim
} 

The POS-tagging is done by a POS-tagger. A POS-tagger determines the Part-Of-Speech tag (POS-tag) of a token based on the sentence context. Therefore, we have to create toString methods for all elements related to the boilerplates in order to provide complete sentences as input for the tagger.

NLP framework integration

In order to validate natural language we use the framework TokensRegex from the Stanford Natural Language Processing Group. Its part of the Stanford CoreNLP suite and provides an API which allows the use of cascaded regular expressions over tokens. Basically this means the framework allows pattern matching over sentence phrases or words using regular expression and POS-tags. If you want to use it directly you can integrate it via Maven.

<dependency>
    <groupId>edu.stanford.nlp</groupId>
    <artifactId>stanford-corenlp</artifactId>
    <version>3.6.0</version>
</dependency>
<dependency>
    <groupId>edu.stanford.nlp</groupId>
    <artifactId>stanford-corenlp</artifactId>
    <version>3.6.0</version>
    <classifier>models</classifier>
</dependency> 

I provided a wrapper around the TokensRegex framework. Its called POSRegex and is available at github. I recommend this wrapper because it is tailored for the use in this series. To integrate the POSRegex.jar we add it to Bundle-Classpath in the Manifest.

Bundle-ClassPath: lib/POSRegex-0.0.2.jar,
 . 

We have to make sure that we initialize POSRegex only once at the Eclipse startup because of the time consuming loading of the reference files (treebanks) which gonne be used by the POS-tagger. Therefore, we register it as eager Singleton in the runtime-module.

override configure(Binder binder) {
    super.configure(binder);
    binder.bind(IPOSRegexPattern).to(POSRegexPattern).asEagerSingleton();
} 

Finally we add the following line to the build.properties of the runtime project to make sure the jar is included in our Eclipse plugin. 

jars.extra.classpath = lib/POSRegex-0.0.2.jar   

Natural language validation

With the model-to-text transformation and the natural language library we are now able to validate the usage of domain specific concepts. We separate the natural language validation from the model based validation rules by defining a new class extending Abstract<YourLanguageName>Validator in the validation package of the runtime project. Xtext allows the use of multiple validators by the annotation @ComposedChecks. In order to register our new validator we annotate the <YourLanguageName>Validator in the following way.

@ComposedChecks(validators = #[MyNaturalLanguageValidator])
public class Validator extends AbstractValidator { 
}

The next step is to implement the validation rules in the natural language validator. Therefore, we inject the IPOSRegexPattern and our model-to-text converter which is called BoilerplateToStringConverter in the following example. The method checkObjectWithDetailsContainsFunction checks if each RequirementEnd contains exactly one verb. It is called for every RequirementEnd instance by Xtext.

@Inject
IPOSRegexPattern posRegex

@Inject
BoilerplateToStringConverter converter

@Check(CheckType.NORMAL)
def checkObjectWithDetailsContainsFunction(RequirementEnd end) {
    val pattern = '(?$verb[pos:VB|pos:VBD|pos:VBG|pos:VBN|pos:VBP|pos:VBZ])'
    val requirement = end.eContainer as Requirement
    val reqString = converter.toString(requirement)
    val result = posRegex.match(reqString, pattern)
    val objectWithDetails = end.objectWithDetails
    val owdString = converter.toString(objectWithDetails)
    val verbs = result.tokensByGroup.get("verb")
    val verbsInOwdString = countOccurrences(owdString, verbs)
    val literal = Package.Literals.REQUIREMENT_END__OBJECT_WITH_DETAILS
    if (verbsInOwdString.size != 1) {
        error("This text must contain one verb which stands for a function of the system", literal)
    }
}

The natural language validation can be time consuming. Therefore, we annotate the method with the annotation CheckType.NORMAL to ensure that the check is only executed when the user saves the document. In the first step we define a pattern containing a non-capturing group called verb which matches multiple types of verb POS-Tags (VB=Verb base form, VBD=Verb past tense, VBG=Verb gerund or present participle,...). A detailed description of the syntax can be found in the TokensRegex documentation. The supported POS-tags are equivalent to the POS-tags defined by the University of Pennsylvania which are described here. In the next step we get the Requirementinstance, translate it to text and run the pattern matching. The returned result contains all found verbs in the verb group. Afterwards, we count the occurrence of these verbs in the free text (objectWithDetails) part of the boilerplate. If not exactly one verb is found, we annotate the objectWithDetails with an error.

We can do this kind of validation also for the domain objects. The following pattern could be used to check that each objectWithDetails contains a domain object with a name consisting of one or two nouns.

(?$noun[pos:NN|pos:NNS|pos:NNP|pos:NNPS]{1,2})  

Summary & outlook

In this part we saw how to validate the use of domain specific concepts inside the free text parts of the boilerplates. Therefore, we integrated a natural language framework and used it together with the validation API of Xtext.

In Part 3 we will extend the use of NLP techniques. We will identify domain specific concepts in the free text, annotate them and provide the user quick fixes either to link the concepts with existing concepts in the glossary or to create new glossary entries. Therefore, we will use the quick fix API of Xtext.

Comments