In the first part of this series we defined a Xtext grammar based on boilerplates in order to control the use of natural language and create acceptable requirements as they are written. Another approach to improve the quality of textual requirements is the use of Natural Language Processing (NLP) techniques to control their quality in terms of grammar and vocabulary after they have been written. Read on to learn more!
The NLP technique related to this post is Part-Of-Speech (POS) tagging which categorizes the tokens (words or phrases) of a sentence into different types like verbs or nouns. Due to the fixed grammatical structure of the boilerplates defined by our Xtext grammar, an more expensive grammatical analysis using NLP techniques is not necessary.
We will use POS-tagging to define constraints for the use of domain specific concepts in the free text parts of the boilerplates. In our approach these concepts are objects and functions which appear as nouns and verbs in the free text parts of the boilerplates.
According to our grammar that means that each RequirementEnd
must contain exactly one verb representing a Function
and at least one noun which represents an DomainObject
in our Glossary
.
Model-to-text transformation
In order to analyse the boilerplates using NLP techniques, we have to implement a model-to-text transformation which transforms the boilerplates from their representation in the model to plain text. Therefore, we implement a new class which contains toString
methods for the different elements of the boilerplates. Due to the structure of the rules we defined in part one of this series, the transformation is rather simple. The following example shows the toString
method for the type TextWithReferences
.
def toString(TextWithReferences text) { if (!text.onlyRefs.empty) { return text.onlyRefs.map[name].join(" ").trim } val elements = newLinkedList elements.addAll(text.refBefore.map[name]) elements.addAll(text.text) text.after.forEach [ comb | elements.addAll(comb.refs.map[name]) elements.addAll(comb.text) ] elements.addAll(text.finalRef.map[name]) elements.join(" ").trim }
The POS-tagging is done by a POS-tagger. A POS-tagger determines the Part-Of-Speech tag (POS-tag) of a token based on the sentence context. Therefore, we have to create toString
methods for all elements related to the boilerplates in order to provide complete sentences as input for the tagger.
NLP framework integration
In order to validate natural language we use the framework TokensRegex from the Stanford Natural Language Processing Group. Its part of the Stanford CoreNLP suite and provides an API which allows the use of cascaded regular expressions over tokens. Basically this means the framework allows pattern matching over sentence phrases or words using regular expression and POS-tags. If you want to use it directly you can integrate it via Maven.
<dependency> <groupId>edu.stanford.nlp</groupId> <artifactId>stanford-corenlp</artifactId> <version>3.6.0</version> </dependency> <dependency> <groupId>edu.stanford.nlp</groupId> <artifactId>stanford-corenlp</artifactId> <version>3.6.0</version> <classifier>models</classifier> </dependency>
I provided a wrapper around the TokensRegex framework. Its called POSRegex and is available at github. I recommend this wrapper because it is tailored for the use in this series. To integrate the POSRegex.jar
we add it to Bundle-Classpath in the Manifest.
Bundle-ClassPath: lib/POSRegex-0.0.2.jar, .
We have to make sure that we initialize POSRegex only once at the Eclipse startup because of the time consuming loading of the reference files (treebanks) which gonne be used by the POS-tagger. Therefore, we register it as eager Singleton in the runtime-module.
override configure(Binder binder) { super.configure(binder); binder.bind(IPOSRegexPattern).to(POSRegexPattern).asEagerSingleton(); }
Finally we add the following line to the build.properties
of the runtime project to make sure the jar is included in our Eclipse plugin.
jars.extra.classpath = lib/POSRegex-0.0.2.jar
Natural language validation
With the model-to-text transformation and the natural language library we are now able to validate the usage of domain specific concepts. We separate the natural language validation from the model based validation rules by defining a new class extending Abstract<YourLanguageName>Validator
in the validation
package of the runtime project. Xtext allows the use of multiple validators by the annotation @ComposedChecks
. In order to register our new validator we annotate the <YourLanguageName>Validator
in the following way.
@ComposedChecks(validators = #[MyNaturalLanguageValidator]) public class Validator extends AbstractValidator { }
The next step is to implement the validation rules in the natural language validator. Therefore, we inject the IPOSRegexPattern
and our model-to-text converter which is called BoilerplateToStringConverter
in the following example. The method checkObjectWithDetailsContainsFunction
checks if each RequirementEnd
contains exactly one verb. It is called for every RequirementEnd
instance by Xtext.
@Inject IPOSRegexPattern posRegex @Inject BoilerplateToStringConverter converter @Check(CheckType.NORMAL) def checkObjectWithDetailsContainsFunction(RequirementEnd end) { val pattern = '(?$verb[pos:VB|pos:VBD|pos:VBG|pos:VBN|pos:VBP|pos:VBZ])' val requirement = end.eContainer as Requirement val reqString = converter.toString(requirement) val result = posRegex.match(reqString, pattern) val objectWithDetails = end.objectWithDetails val owdString = converter.toString(objectWithDetails) val verbs = result.tokensByGroup.get("verb") val verbsInOwdString = countOccurrences(owdString, verbs) val literal = Package.Literals.REQUIREMENT_END__OBJECT_WITH_DETAILS if (verbsInOwdString.size != 1) { error("This text must contain one verb which stands for a function of the system", literal) } }
The natural language validation can be time consuming. Therefore, we annotate the method with the annotation CheckType.NORMAL
to ensure that the check is only executed when the user saves the document. In the first step we define a pattern
containing a non-capturing group called verb
which matches multiple types of verb POS-Tags (VB=Verb base form, VBD=Verb past tense, VBG=Verb gerund or present participle,...). A detailed description of the syntax can be found in the TokensRegex documentation. The supported POS-tags are equivalent to the POS-tags defined by the University of Pennsylvania which are described here. In the next step we get the Requirement
instance, translate it to text and run the pattern matching. The returned result
contains all found verbs in the verb
group. Afterwards, we count the occurrence of these verbs in the free text (objectWithDetails
) part of the boilerplate. If not exactly one verb is found, we annotate the objectWithDetails
with an error.
We can do this kind of validation also for the domain objects. The following pattern could be used to check that each objectWithDetails
contains a domain object with a name consisting of one or two nouns.
(?$noun[pos:NN|pos:NNS|pos:NNP|pos:NNPS]{1,2})
Summary & outlook
In this part we saw how to validate the use of domain specific concepts inside the free text parts of the boilerplates. Therefore, we integrated a natural language framework and used it together with the validation API of Xtext.
In Part 3 we will extend the use of NLP techniques. We will identify domain specific concepts in the free text, annotate them and provide the user quick fixes either to link the concepts with existing concepts in the glossary or to create new glossary entries. Therefore, we will use the quick fix API of Xtext.
Comments