What is going wrong with the Semantic Web?

  • Posted on: 10 April 2018
  • By: warren
English

The US Semantic Technologies Symposium was held at Wright State University a month ago where there were great discussions with Craig Knoblock about SPARQL servers reliability1Eric Kansa about storing archeology data and Open Context, Eric Miller about the workings of the W3, mid West farmers and old bikes, Matthew Lange about tracking crops with LOD and a 'fruitful' talk with Evan Wallace about farm data storage standards. 

Thinking through these conversations, I decided to outline what I think are the troubling conclusions for our area, namely that a) Semantic Web adoption is lagging, b) we keep rehashing old problems without moving on and c) our ongoing lack of support for our own projects after which I'll suggest a few solutions.

Semantic Web adoption is not where we'd like it to be

Very, very few people care about data management2. Even fewer people understand data management. I'd go so far to say that the majority of the IT community spends its time moving strings to the end-user's screen, focusing primarily on user communications and getting told what to communicate. Other developers may worry about analysis, networking stacks or storage, but the number of them that care about the data itself are few.

That leaves the database developer whose entire attention is on bread and butter issues. Why did we have the hubris to think that people would care about the Semantic Web when most of them have no data management problem to worry about? 31% of webpages are reported to contain schema.org3, primarily because Web Developers believe it will help them with SEO issues, not because it helps them with data management or interoperability.

Reinventing the wheel. Again.

I regret I didn't note the speaker who said "If your developers care about JSON, I don't care about your developers", because it goes to the heart of the matter about poor Semantic Web training and education. At this stage arguments about serializations are about as relevant as debating whether submarines can swim4. The was a lot of talk at the meeting about creating new json standards to handle corner cases without knowledge or regards for previous standards because "it's not JSON and people want JSON". The Semantic Web stack translates the model to whatever serialization is needed, in most cases negotiated without programmer involvement. JSON is really nice for web developers, RDF/XML for XPATH, turtle for authoring, n3 for throughput, et al. David Booth also noted the panoply of standards and vocabularies. A number of them have been beautifully engineered by domain experts (GeoSPARQL5, OWL-TIME, SOSA and PROV come to mind), it's an outright waste of everyone's time not to reuse them.

We expect research groups to act like service providers

The lack of reliable services and exemplars was also noted: the curated New York Times RDF dataset is no longer answering, the BBC has cut back on outward looking Semantic Web services and DBPedia, at the heart of the LOD cloud, is still running on a borrowed virtual machine with the DBPedia Association having a hard time raising funds. I would like to echo Juan Sequeda's post that we should set aside some grant monies for resources such as Linked Open Vocabularies, a  great vocabulary/ontology location tool6 Getting operational funding is always a slog but we cannot advocate for a technology when the exemplars are not maintained and disappear overnight.

In the past we've gotten away with a lot by stuffing machines under graduate students's desks and getting them to write applications between course work and thesis submission. This is not sustainable and we need to make an effort on long term sustainability.

What we should be doing

The Semantic Web stack is annoyingly complex, not because of the technology but because of the problems that it is trying to solve. Its critics abound (even Hitler apparently) but there is no real alternative to deal with data at scale. Organizationally, it sits uncomfortably between two communities:

The first is the small group of developers that deal with web apis, mostly independently from each other. Integrations are done on an ad-hoc basis when the one-off business requirement presents itself. These are the people that came up with ideas like Swagger: simple documentation that focuses on programmatic operations with little semantics about the transaction itself. Want it in Orange? Set colour_id to 2. Why 2? Because that's the value some developer arbitrarily decided on at the time. Why is your self-evident use case not handled? Because no one has needed it before. Development is incremental. If an error occurs, put in a ticket into github, no harm done.

The second is the Enterprise Resource Planning crowd that has has been doing this for a very long time, albeit usually within a single organization and with massive amounts of corporate resources. Because they care deeply that orders of 5,000 sheets of 8.5x11 paper aren't interpreted as orders of 8,511 sheets of 5000in2 paper, they tend to document everything (a single API document may run 100's of pages) and have a neurotic attention to change management. There have been spectacular failures when implementing these mammoth7 systems, but generally you can order something from across the world and it will show up on your doorstep next week. 

The Semantic Web has a lot to offer to both these communities: a ready made semantic modelling language8 that is reusable by web apis, URL-based global identifiers and a unified multilingual documentation framework than fits corporate needs. Bridges need to be built with application domain experts and with existing data eco-systems. Logistics systems such as Global Trade Item Number are pushing the limits of what we can do with barcodes and relational databases. We want the Internet of Things, the Internet of Food, a smart power and transportation grid and a bibliographic system that isn't going to split it's seams. The only way that we can achieve all of this is to have the data that is being generated supported by content and the Semantic Web.

  

Talk: Ontologies, Semantic Web, and Linked Data for Business

  • Posted on: 28 November 2017
  • By: warren
Undefined

13 February 2018, from 10am to 2pm.
 

This is a half-day workshop about the current business uses of the semantic web. It is targeted at executives, project managers and subject matter experts who want to understand what problems the technology can solve. This workshop will concern itself with the basic building blocks of the semantic web and the solutions that each aspect brings to an organization. The objective of the workshop is not to provide in-depth technical training, rather, we wish to present an overview that will enable a varied audience to determine what this technology provides for their organization. Specific aspects will include recent standards such as the FIBO and schema.org and the recruitment and training of staff, as well as opportunities for localization to different markets and lowering the cost of regulatory reporting. In order to anchor the discussions the tribulations of a fictional company, "The Triples Coffee Company", will be used to present business cases within different areas of an enterprise. Example solutions will then be outlined using a semantic web approach for each business case.

Jointly presented with Robert Warren, Ph.D., Jennifer Schellinck, Ph.D., Patrick Boily, Ph.D.

AI and the Law (Part II) - How AI Works

  • Posted on: 31 March 2017
  • By: warren
Undefined

(Note: This is the second part of a series of posts that were based on several conversations with lawyers and executives about AI, the nature of technology and its application to business problems. The first part is here.)

At the heart of it, AI is about asking the following question: "Can I use the computer to make decisions that would normally require a human being?" Of course, the obvious answer is yes; human beings make all sorts of decisions all day long, ranging from the complex to the mundane. Accounting and operations systems have been making decisions for human beings for years, be it from calculating credit scores and interest rates to determining the best time to order feedstock.

Let's use a simple example to explain the difference. Take the plot on the right hand side where I have orange dots and green dots. With basic statistical methods (linear regression was invented in the early 1900's[1]), we can create a simple classifier that will separate the green from the orange by simply drawing a line through the graph. It's not perfect, some oranges are misclassified as green and vice-versa, but we do very well with a really simple method. We can do better using mathematical techniques that are more sophisticated, see a non-linear method on the left, but fundamentally the problem remains simple: telling the granny smith apples from the oranges.

The problem is simple to solve because it is well defined. The objective is clear, the definition of success is clear (keep both sets of dots separated) and the way to tell them apart is by their colour. The only thing that remains is applying the recipe that matches the problem, a linear regression in this case, to solve the problem. Loosely speaking, there is no intelligence needed because the problem defines its own process to a solution.

Similarly, let's take another toy problem: tic-tac-toe. Every school child, even those that don't eventually end up working on AI, learns that the game can be tied or won by the first person playing. The second player can always force a tie, but can never win the game. All of us learn this by playing the game repeatedly while young: children over time explore the set of possible game layouts in tic tac toe and eventually learn that there are a finite number of starting X's and O's that can lead to victory or loss.

Very roughly, there are over 360,000 possible positions in tic-tac-toe. One basic machine learning method to learn to play tic-tac-toe is brute-force: try every single move and counter-move until every game is enumerated and then only choose moves that end in winning the game. Obviously, children can't keep track of 360,000 tic-tac-toe boards at the same time and while effective, the method does not scale well even for computers (Your desktop computer can't store the 10^47 possible combinations of chess). Therefore, children learn to take shortcuts and reduce those 360,000 positions into a set of starting moves that ensures that they win the game every time. That is intelligence: no one taught them the process required to find the solution, they just did. Artificial Intelligence is the science of creating algorithms that can do the same for certain classes of problems.

We say "classes of problems" because we only have the computing power and the programming know-how to handle limited problems, like recognizing a person, a text or a musical genre. This is different from what people sometimes refer to as "True AI", which is the human-like machine that walks and talks (and invariably tries to take over the world) on television shows. In my opinion, you are unlikely to have the Terminator do your filing for you anytime in the near future.

However for specific types of problems, AI works very well: classification, clustering, searching, reduction, etc... In turn, this means that the most of the work that goes into implementing an AI engine is actually trying to match very simple mathematical solutions to complex business problem. Going back to our first example of apples and oranges, the problem was delivered on a plate: colour and position. The solution needs more thinking when the problem isn't so well defined. In some cases, we know that some of the dots are different but not why or which ones (eg: Outlier Detection). In others, the objective may be to "Group the similar dots together" without having any idea of what makes the dots similar (eg: Market Segmentation). 

Machine Learning vs Artificial Intelligence

In the vernacular, the terms Machine Learning and Artificial Intelligence are sometimes used interchangeably through they refer to different things. Artificial Intelligence is the catch all phrase for different computational techniques that have an intelligence component to them irrespective of the flexibility or adaptability of the method. Machine Learning refers to methods that are capable of learning themselves from the data without having their decision model encoded by a human being.

Take a program that can play tic-tac-toe. It clearly has an intelligence component in order to function, but the software will not learn from its interaction with the user (The problem is simple enough that there is no point to). But a program that recognizes cats in videos needs a machine-driven learning component is order for it to learn what a cat looks like.

Classifying And Finding Things With Artificial Intelligence

The figure below represents a very simplified block diagram of the AI process for classification; starting from the left to the right. We have a data set that we want processed, which can be documents, images, songs, video, etc... In practice, not everything within that data set is relevant to the context of what we are trying to do and so we transform each document into a set of features that we think are valuable to solve our problem. A feature might be a specific word in a document, another might be the word's part-of-speech (verb, noun, adverb, etc) or a typographical aspect such as the word being underlined.

Since the feature set explicitly determines what part of the data set the algorithm will actually look at, feature generation is an extremely important part of the Artificial Intelligence process. It has spawned it's own field of study, Feature Engineering, and at times some have insisted that Artificial Intelligence is carefully crafted Feature Engineering. In practice, many engines have enough computational resources that they will simply generate every feature possible from the input data and the algorithm will simply choose the features that are most promising (This is called the "Throwing things at the wall to see what sticks" approach). It's wasteful, but computing time has become much cheaper than the people time required to create an efficient design.

The algorithm is the brain of AI, which is ironic in that the algorithm in itself is usually very simple and generic; the same algorithm that flies a flying drone might be the same that keeps your phone camera images from being blurry. However, the devil is in the details and the implementation of the algorithm is usually not portable from the phone to the drone. Examples of algorithms might be k-means, C4.5 or okapi, each one trying to perform the task of clustering, rule generation or information retrieval. As part of the process, the algorithm will take in the features and select the most promising. As part of that selection, some external information such as a trained model or parameter might be provided to the algorithm to guide it's decision making.

The results are then checked against a benchmark, sometimes called a gold standard, to ensure that the system is doing what it is supposed to. If the results aren't exactly what is required the model or the parameters of the algorithm might be changed.

Overall, the basics of AI aren't that complex, but its implementation and arrangement needs to be focused on the objectives of the project, otherwise one gets into the loop of "garbage in, garbage out". Depending on the case, the tuning of parameters can be frustrating and model generation becomes an art and not a science. There are many different frameworks, libraries and code bases available both freely and commercially to experiment with which I encourage you to do.

Next: Part III - Algorithmic Bias is good for you.

AI and the Law (Part 1)

  • Posted on: 30 January 2017
  • By: warren
Undefined

(Note: This is the first part of a series of posts that were based on several conversations with lawyers and executives about AI, the nature of technology and its application to business problems.)

What is Artificial Intelligence?

Does it really represent an improvement over what we already have? An entirely new class of solutions to ongoing problems? Or the flavour of the week in a market that is overwhelmed with buzzwords?

Skepticism is endemic to technology culture in industry, government or academia. It's a byproduct of working in an area who's foundation is innovation and ideas. When it costs significantly less to say that you have something than to actually get it to work: a "show me" attitude is necessary. Ironically, IT has been so far primarily about what we would call classical Management Information System. The software may be really slick, the hardware may be really fast and we can store a lot of data, but really most of what the industry has focusing on so far is simply replacing physical forms and paperwork with the electronic equivalent: tabulating ledgers for accounting, generating reports and mailing checks and invoices over the Internet instead of in paper form. These are boring, unglamorous tasks but they have been IT's big success: taking things that were mundane, repetitive and that were cost centres and streamline it using technology.

And now, we have Artificial Intelligence.

Image of a copper engraving from Karl Gottlieb von Windisch's 1783 book Briefe über den Schachspieler des Hrn. von Kempelen, nebst drei Kupferstichen die diese berühmte Maschine vorstellen.

The underlying idea that a machine would replace a human being in making decisions isn't all that new. One of the better know historical exhibits (and fraud) is The Turk, a mechanical automaton that would play chess against a person. Of course, playing chess was asking a lot of simple clockwork mechanisms and the builder had constructed a false compartment in which a human player would hide and move the mannequin using a system of pulleys and cams. This may have been the first vapourware product ever, but the idea that a machine could perform tasks at a human level had taken root.

What we'll call modern Artificial Intelligence appeared in the mid-1950's when scientist began to look at ways that elements of human cognition could be modelled using mathematics. That in itself wasn't novel, humankind had moved on from the abacus. What they were aiming at were higher cognition functions like learning from examples and extrapolating solutions for problems that the machine had never seen before.

In this series of blog posts, the basics of AI will be reviewed and its application to practical business problems outlined. As with many technologies it has had its false starts and the causes of the AI Winters periods will be reviewed which in turn, will give a sense of why it is making such a resurgence.

Next: Part II - How AI Works.

Presentation at Derby University: Artificial Intelligence and the Law

  • Posted on: 21 November 2016
  • By: warren
Undefined
Artificial Intelligence and the Law
Wednesday, 30 November 2016 at 5.30 PM

 

This talk is dedicated to the memory of David F. Evans.

 

After years of over-promising and under-delivering, and two so-called “winters”, Artificial Intelligence is aggressively making inroads in the legal and financial markets. Affordable computational power, bandwidth, and storage each played a part in this revival, but adoption is really the result of targeted products that do what machine learning is historically excellent at: domain specific applications. Concerns over lost employment, reduced professional prestige, and the accountability of such approaches (Do algorithms behave ethically? Should they be regulated?) remain a hot topic for the layman and the practitioner alike as the technology finds new niches to occupy.
This talk is about some lessons learned about the application of AI to legal document review and some of the unexpected corner-cases found along the way, including creative system training methods, tackling entrenched cultural beliefs, and the thorny (and multifaceted) issues of information security.
Working on these problems has highlighted the value of human judgement, the limits of computation, and how scaling problems affect both people and machines. More than ever, sound theory, good mathematics, and a well-rounded approach are needed to tackle an increasingly complex and globalized world.
Tags: 

Presentation to the Security Group, University of Kent

  • Posted on: 15 November 2016
  • By: warren
Undefined
Processing legal documents with AI: Notes from the field.
Wednesday, 23 November 2016, 5PM
 
In a world where the marginal cost of copying information is nil, the marginal cost of storing information is nil, and the marginal cost of transmitting information is nil, the opportunities for data disasters are numerous and their consequences are devastating to individuals and corporations. Even well-intentioned gestures such as AOL's release of anonymized query logs have resulted in privacy violations in ways not immediately obvious to the principals.
 
Interestingly, portability and interoperability are now working against us: in the world of single sign on, once you have access to the user you have access to all of their systems, and the first leak tends to be the last leak. This talk will be about the security of AI engines in contract analysis and the unexpected and counterintuitive lessons learned along the way. Security auditors want audit logs; forensics-aware systems people want traceability; privacy advocates want minimal information kept; developers want ease of maintenance and debugging; and users want to be minimally impacted. Now all of these views are necessarily compatible and some interesting conflicts arise in the creation and operations of applications.
 
In the end, the creation of online apps that handle sensitive information, preserve privacy and security, while enabling distributed teams to collaborate requires much more than following best practices. They rest on a culture of security within the organization.

Presentation at the Canadian Linked Data Summit: Operationalizing Linked Open Data

  • Posted on: 12 October 2016
  • By: warren
Undefined
CLDI LogoOperationalizing Linked Open Data
Venue: University of Montreal, 3200 Jean-Brillant, Room B-2245
Monday, October 24th 2016, 14:10 - 14:30
 
This talk summarizes the combined experiences of the Muninn Project and the Canadian Writing Research Collaboratory in operating large linked open data projects. Topics that will be touched on will include best operating practices, known pitfalls and realizing the promise of the semantic web for researchers.  
 
Presentation slides in English and in French.

Presentation at Museums on the Web 2015

  • Posted on: 9 March 2015
  • By: warren
Undefined
Palmer House Hilton, Chicago, IL, USA
April 8-11, 2015, 10:30am - 12:00pm
Grand Ballroom (4F) 
Joint work with David Evans, Minsi Chen, Mark Farrell and Daniel Mayles.
We review the possibilities, pitfalls, and promises of recreating lost heritage sites and historical events using augmented reality and "Big Data" archival databases. We define augmented reality as any means of adding context or content, via audio/visual means, to the current physical space of a visitor to a museum or outdoor site. Examples range from simple prerecorded audio to graphics rendered in real time and displayed using a smartphone.
Previous work has focused on complex multimedia museum guides, whose utility remains to be evaluated as enabling or distracting. We propose the use of a data­-driven approach where the exhibits' augmentation is not static but dynamically generated from the totality of the data known about the location, artifacts, or event. For example, at Bletchley Park, reenacted audio conversations are played within rooms as visitors walk through them. These can be called "virtual contents," as the audio recordings are manufactured. Given that a number of documentary sources, such as meeting minutes, are available concerning the events that occurred within the site, a dynamic computer-generated script could add to the exhibits.
Visitors' experiences can therefore react to their movements, provide a different experience each time, and be factually correct without requiring any expensive redesign. Furthermore, the use of a data-driven approach allows for the updating of exhibits on the fly as researchers create or curate new data sources within the museum. If artifacts need to be removed from an exhibit, pictures, descriptions, or three-dimensional printed copies can be substituted, and the augmented reality of visitor experience can adapt accordingly.

Presentation at the Department of History and Classics, Acadia University

  • Posted on: 5 March 2015
  • By: warren
Undefined

Acadia_Crest.jpg

Mapping the Western Front: the British and German experiences
March 26th, 7pm, BAC241
The static nature and scale of the battles on the Western Front was unwelcome to both Entente and Central powers during the Great War. Faced with logistical requirements on an unprecedented scale, standardized maps at different scales had to be produced of the battlefield quickly for both tactical and strategic purposes. This was a minor revolution in military thinking: previously cavalry officers were expected to ride with a sketch-board to map out terrain and enemy positions for their commanders.
In this talk I will contrast the Entente and Central efforts at mapping battlefields, highlight the differences in the approaches they took as well as evidence about local military intelligence activities. Both British and German coordinate systems will be explained as well as how to geo-reference these maps into modern mapping software.

Code Event: How to open a coffee shop in Halifax in 3 minutes

  • Posted on: 28 January 2015
  • By: warren
Undefined

codeEvent.png

How to open a coffee shop in Halifax in 3 minutes
Code Event Halifax
Theater B, Tupper Building, Dalhousie University
January 30th, 2015, 7PM
 
In this talk I will demonstrate the power and value of open data by showing how an entrepreneur can choose a location to open a new coffee shop in Halifax using data-sets available on the Halifax and Canada Open Data Portals. Specifically, we will locate an available rental space for the coffee shop while keepings in mind the locations of potential competitors and customers.
 
Slides are available here.
 
codeBanner.jpg
 
 
 
 
 
 
 
talkcode.jpg

Pages