Presenters:
Wlodek Zadrozny, Walid Shalaby and Sean Gallagher, Computer Science Department
We will give three short talks.
Walid Shalaby will present joint work with Wlodek Zadrozny and Sean Gallagher (all UNCC)
"Knowledge Based Dimensionality Reduction for Technical Text Mining" (to be presented at IEEE Complexity for Big Data in Washington DC on Oct 27) Abstract-In this paper we propose a novel technique for dimensionality reduction using freely available online knowledge bases. The complexity of our method is linearly proportional to the size of the full feature set, making it applicable efficiently to huge and complex datasets. We demonstrate this approach by investigating its effectiveness on patent data, the largest free technical text. We report empirical results on classification of the CLEF-IP 2010 dataset using bigram features supported by mentions in Wikipedia, Wiktionary, and GoogleBooks knowledge bases. We achieve a 13-fold reduction in number of bigrams features and a 1.7% increase in classification accuracy over the unigrams baseline. These results give concrete evidence that significant accuracy improvements and massive reduction in dimensionality could be achieved using our approach, hence help alleviating the tradeoff between task complexity and accuracy.
Sean Gallagher will talk about
"Simulating IBM Watson in the Classroom" (joint work with Wlodek Zadrozny, Walid Shalaby and others) [under review]
ABSTRACT: IBM Watson exemplifies multiple innovations in natural language processing and question answering. In addition, Watson uses most of the known techniques in these two domains as well as many methods from related domains. Hence, there is pedagogical value in a rigorous understanding of its function. The paper provides the description of a text analytics course focused on building a simulator of IBM Watson, conducted in Spring 2014. We believe this is the first time a simulation containing all the major Watson components was created in a university classroom. The system achieved a respectable (close to) 20% accuracy on Jeopardy! questions, and there remain many known and new avenues of improving performance that can be explored in the future. The code and documentation are available on Github. The paper is a joint effort of the teacher and some of the students who were leading teams implementing component technologies, and therefore deeply involved in making the class successful.
Wlodek Zadrozny will talk about
Explaining Watson: Polymath Style (joint with Larry Moss, Indiana and Valeria de Paiva, Nuance) [to be submitted to AAAI'2015]
Abstract: Our paper is actually two contributions in one. First, we argue that IBM's Jeopardy! playing machine needs a formal semantics. We present several arguments as we discuss the system. We also situate the work in the broader context of contemporary AI. Our second point is that the work in this area might well be done as a broad collaborative project; we propose to organize a polymath-style effort aimed at developing formal tools for the study of state of the art question-answer systems, and possibly other large scale NLP efforts whose architectures and algorithms lack a theoretical foundation.