|
|
|
|
Click on the titles to view the abstracts.
| Date | Speaker | Title |
| 07 Jun 2013 | Malte Nuhn (Aachen University, Germany) |
Is Decipherment Difficult?
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: Is it possible to learn useful translations from large amounts of monolingual data to improve machine translation? The intuitive feeling is that learning a language without bilingual data is at least "more difficult than learning from example translations". In this talk, I will present recent results on decipherment: I will show that the decipherment problem is indeed difficult (NP-hard) and what approximations to the original problem can be made without hurting decipherment accuracy much.
|
| Date | Speaker | Title |
| 17 May 2013 | Qing Dou |
Deciphering Gigaword
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: State of the art machine translation systems learn translation rules from large amounts of parallel data (pairs of sentences that are translation of each other). Unfortunately, the amount of parallel data is very limited for many languages and domains. In general, it is easier to obtain monolingual data. Is it possible to learn useful translations from large amounts of monolingual data to improve machine translation when the amount of parallel data is limited? In this talk, I will present my ongoing work that applies decipherment techniques to decipher hundreds of millions Spanish news texts into English and learns a translation lexicon from the decipherment to improve a translation model learned from limited parallel data. |
| 03 May 2013 | Dirk Hovy |
Learning Semantic Types and Relations from Text (Defense Practice Talk)
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: NLP applications such as Question Answering (QA), Information Extraction (IE), or Machine Translation (MT) are incorporating increasing amounts of semantic information. A fundamental building block of semantic information is the relation between a predicate and its arguments, e.g. eat(John,burger). In order to reason at higher levels of abstraction, it is useful to group relation instances according to the types of their predicates and the types of their arguments. For example, while eat(Mary,burger) and devour(John,tofu) are two distinct relation instances, they share the underlying predicate and argument types INGEST(PERSON,FOOD). A central question is: where do the types and relations come from? The subfield of NLP concerned with this is relation extraction, which comprises two main tasks: 1. identifying and extracting relation instances from text 2. determining the types of their predicates and arguments The first task is difficult for several reasons. Relations can express their predicate explicitly or implicitly. Furthermore, their elements can be far part, with unrelated words intervening. In this thesis, we restrict ourselves to relations that are explicitly expressed between syntactically related words. We harvest the relation instances from dependency parses. The second task is the central focus of this thesis. Specifically, we will address these three problems: 1) determining argument types 2) determining predicate types 3) determining argument and predicate types. For each task, we model predicate and argument types as latent variables in a hidden Markov models. Depending on the type system available for each of these tasks, our approaches range from unsupervised to semi-supervised to fully supervised training methods.
The central contributions of this thesis are as follows:
1. Learning argument types (unsupervised): We present a novel approach that learns the type system along with the relation candidates when neither is given. In contrast to previous work on unsupervised relation extraction, it produces human-interpretable types rather than clusters. We also investigate its applicability to downstream tasks such as knowledge base population and construction of ontological structures. An auxiliary contribution, born from the necessity to evaluate the quality of human subjects, is MACE (Multi-Annotator Competence Estimation), a tool that helps estimate both annotator competence and the most likely answer.
2. Learning predicate types (unsupervised and supervised): Relations are ubiquitous in language, and many problems can be modeled as relation problems. We demonstrate this on a common NLP task, word sense disambiguation (WSD) for prepositions (PSD). We use selectional constraints between the preposition and its argument in order to determine the sense of the preposition. In contrast, previous approaches to PSD used n-gram context windows that do not capture the relation structure. We improve supervised state-of-the-art for two type systems.
3. Argument types and predicates types (semi-supervised): Previously, there was no work in jointly learning argument and predicate types because (as with many joint learning tasks) there is no jointly annotated data available. Instead, we have two partially annotated data sets, using two disjoint type systems: one with type annotations for the predicates, and one with type annotations for the arguments. We present a semisupervised approach to jointly learn argument types and predicate types, and demonstrate it for jointly solving PSD and supersense-tagging of their arguments. To the best of our knowledge, we are the first to address this joint learning task.
Our work opens up interesting avenues for both the typing of existing large collections of triple stores, using all available information, and for WSD of various word classes.
|
| 12 Apr 2013 | Hui Zhang |
Beyond Left-to-Right: Multiple Decomposition Structures for SMT
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: Standard phrase-based translation models do not explicitly model context dependence between translation units. As a result, they rely on large phrase pairs and target language models to recover contextual effects in translation. In this work, we explore language models over Minimal Translation Units (MTUs) to explicitly capture contextual dependencies across phrase boundaries in the channel model. As there is no single best direction in which contextual information should flow, we explore multiple decomposition structures as well as dynamic bidirectional decomposition. The resulting models are evaluated in an intrinsic task of lexical selection for MT as well as a full MT system, through n-best re-ranking. These experiments demonstrate that additional contextual modeling does indeed benefit a phrase-based system(up to 2.8 BLEU score) and that the direction of conditioning is important. Integrating multiple conditioning orders provides consistent benefit, and the most important directions differ by language pair. |
| 05 Apr 2013 | Abe Kazemzadeh |
Sentiment and Sarcasm in the 2012 US Presidential Election
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: Political discourse is challenging from a sentiment analysis point of view because political issues are subjective and highly dynamic. Political language may contain neologisms that do not occur frequently in general purpose lexical sentiment models. Also, the presence of humor, sarcasm, and comparatives may introduce errors in sentiment analysis. In Twitter, these issues are amplified by the use of Twitter-specific features and constrained message lengths. In this presentation, we will present a collaborative project between the University of Southern California (USC) Signal Analysis and Interpretation Laboratory, USC Annenberg Innovation Laboratory, and IBM. Our system is relies on manual curation of keywords and hashtags, crowd-sourced annotation, statistical machine learned sentiment models, and a real-time visualization that is ideal for display during live events. We describe our corpus and several experiments using different settings of our sentiment models. Among our findings are that sentiment in politics is skewed towards negative, annotation agreement tend to be low, and that sarcasm is a factor that explains some of the annotator disagreement. We have also studied bigger picture questions such as how much weight tweets by Big Bird (or someone pretending to be Big Bird) should be allocated in reporting the results of sentiment analysis. Question about the role of humor and sarcasm in social media lead to some skepticism of naive applications of sentiment analysis but present interesting examples of content that influences social media user behavior and spills over into traditional media.
This is joint work with Dogan Can, Nikos Malandrakis, Hao Wang, Alex Leavitt, Kevin Driscoll, Kristen Guth, Theo Mazumdar, Varun Lingaraju, Sagar Jhobalia, Mellisa Loudon, Shrikanth Narayanan, Françs Bar, Kjerstin Thorson, Mike Ananny, Sam Thomson, Ed Elze, Graham Mackintosh, Robert Uleman, Leon Katsnelson, and Chris Gruber.
|
| 18 Mar 2013 | Carlo Strapparava |
Computational explorations of creative language
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: Dealing with creative language and in particular with affective, persuasive and even humorous language has often been considered outside the scope of computational linguistics. Nonetheless it is possible to exploit current NLP techniques starting some explorations about it. We briefly review some computational experiences about these typical creative genres. We will start introducing techniques for dealing with emotional and witty language. Then we will talk about the exploitation of some extra-linguistic features: for example music and lyrics in emotion detection, and an audience-reaction tagged corpus of political speeches for the analysis of persuasive language. As examples of practical applications, we will present a system for automatized memory techniques for vocabulary acquisition in a second language, and an application for automatizing creative naming (branding).
Bio: Carlo Strapparava is a senior researcher at FBK-irst (Fondazione Bruno Kessler - Istituto per la ricerca scientifica e Tecnologica) in the Human Language Technologies Unit. His research activity covers artificial intelligence, natural language processing, intelligent interfaces, human-computer interaction, cognitive science, knowledge-based systems, user models, adaptive hypermedia, lexical knowledge bases, word-sense disambiguation, affective computing and computational humour. He is the author of over 150 papers, published in scientific journals, book chapters and in conference proceedings. He also played a key role in the definition and the development of many projects funded by European research programmes.
He regularly serves in the program committees of the major NLP conferences (ACL, EMNLP, etc.). He was executive board member of SIGLEX, a Special Interest Group on the Lexicon of the Association for Computational Linguistics (2007-2010), Senseval (Evaluation Exercises for the Semantic Analysis of Text) organisation committee (2005-2010).
On June 2011, he was awarded with a Google Research Award on Natural Language Processing, specifically on the computational treatment of creative language.
|
| 08 Mar 2013 | Sujith Ravi |
Scalable Unsupervised Learning for Natural Language Processing
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: Abstract: Natural language processing (NLP) tools have become ubiquitous for data analysis in digital environments such as the Web and social media. Popular applications include tools for clustering, sequence labeling, machine translation, to name a few. But unfortunately, majority of the existing toolkits rely on supervised learning to train models using labeled data. This poses several challenges---labeled data is not readily available in all languages or domains and building an NLP system from scratch for a new domain (or language, user, etc.) requires significant human effort which is both time-consuming and expensive. Moreover, scaling this strategy on the Web is infeasible. Recent advances in unsupervised algorithms have demonstrated promising results on several NLP tasks without using any labeled data. But despite their utility, scalable unsupervised algorithms rarely provide probabilistic representations of the data which can be useful for predicting on unseen data or integrated as components with a larger model or pipeline. In addition, these methods often favor simple model descriptions (e.g., k-means algorithm for clustering) in exchange for rich statistical models. This leads to the problem of rapidly diminishing returns when applying these methods on increasing amounts of data. Instead, we need to design algorithms that can scale elegantly to large data as well as complex models. In this work, I will present our recent work on scalable probabilistic learning with Bayesian inference. We show a novel algorithm for fitting mixtures of exponential families, which generalizes several models that are typically used in NLP and other areas. A major contribution of our work is a novel sampling method that uses locality sensitive hashing to achieve high throughput in generating proposals during sampling. Using "clustering" as an example application, I will describe our approach and show that it scales elegantly to large numbers of clusters achieving a speedup of several orders of magnitude over existing toolkits, while maintaining high clustering quality. In addition, we also prove probabilistic error guarantees for the new sampling algorithm. This is joint work with Amr Ahmed and Alex Smola. Lastly, I will briefly mention some ongoing work on large-scale unsupervised learning for other NLP applications such as machine translation. Bio: Sujith Ravi is a Research Scientist at Google. He completed his PhD at University of Southern California/Information Sciences Institute and joined Yahoo! Research, Santa Clara as a Research Scientist before joining Google, Mountain View in 2012. His main research interests span various problems and theory related to the fields of Natural Language Processing (NLP) and Machine Learning. He is specifically interested in large-scale unsupervised and semi-supervised methods and their applications to structured prediction problems in NLP, information extraction, user modeling in social media, graph optimization algorithms for summarizing noisy data, computational decipherment and computational advertising. His work has been reported in several magazines such as New Scientist, ACM TechNews, etc. For more information, you can visit his personal page (http://www.sravi.org).
|
| 22 Feb 2013 | Louis-Philippe Morency |
Modeling Human Communication Dynamics: From Depression Assessment to Multimodal Sentiment Analysis
Time: 3:00 pm - 4:00 pm Location: 6th Floor Conference Room [689] Abstract: Human face-to-face communication is a little like a dance, in that participants continuously adjust their behaviors based on verbal and nonverbal displays and signals. Human interpersonal behaviors have long been studied in linguistic, communication, sociology and psychology. The recent advances in machine learning, pattern recognition and signal processing enabled a new generation of computational tools to analyze, recognize and predict human communication behaviors during social interactions. This new research direction have broad applicability, including the improvement of human behavior recognition, the synthesis of natural animations for robots and virtual humans, the development of intelligent tutoring systems, and the diagnoses of social disorders (e.g., autism spectrum disorder). In this talk, I will present some of our recent work modeling multiple aspects of human communication dynamics, including behavioral dynamic, multimodal dynamic and interpersonal dynamic. I will describe the different computational models specifically designed model these dynamics, including the Latent-Dynamic Conditional Random Fields, Multi-view Hidden Conditional Random Fields and the Latent Mixture of Discriminative Experts. I will show how these technologies can be applied to real-world problems such as negotiation outcome prediction, YouTube opinion mining, group learning analytics and psychological distress indicators. Finally, I will summarize our recent progress in integrating these sensing technologies with a virtual human for healthcare application. Bio:
Louis-Philippe Morency is a Research Assistant Professor in the Department of Computer Science at the University of Southern California (USC) and Research Scientist at the USC Institute for Creative Technologies where he leads the Multimodal Communication and Machine Learning Laboratory (MultiComp Lab). He received his Ph.D. and Master degrees from MIT Computer Science and Artificial Intelligence Laboratory. His research interests are in computational study of nonverbal social communication, a multi-disciplinary research topic that overlays the fields of multimodal interaction, computer vision, machine learning, social psychology and artificial intelligence. Dr. Morency was selected in 2008 by IEEE Intelligent Systems as one of the Ten to Watch for the future of AI research. He received 6 best paper awards in multiple ACM- and IEEE-sponsored conferences for his work on context-based gesture recognition, multimodal probabilistic fusion and computational modeling of human communication dynamics. His work was reported in The Economist, New Scientist and Fast Company magazines.
|
| 08 Feb 2013 | Kartik Audhkhasi |
A Computational Framework for Ensembles of Diverse Experts
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: Ensembles of machine experts, from simple linear classifiers to complex hidden Markov models, have out-performed single experts across many applications. Likewise, ensembles have been central to computing with human experts, e.g. for data annotation. This widespread use of ensembles, albeit largely heuristic, is motivated by their better generalization and robustness to ambiguity in the production, representation, and processing of information. This talk will focus on three important problems which contribute towards a unified computational framework for ensembles of diverse experts. The first problem deals with "modeling" a diverse ensemble. I will present our proposed Globally-Variant Locally-Constant (GVLC) model as a statistical framework for answering this question. The second question is about "analysis", where I will address the link between ensemble diversity and performance using statistical learning theory. The final segment of my talk will focus on "designing" an ensemble of diverse linear classifiers, specifically conditional maximum entropy models. Practical applications throughout the talk will include emotion classification from speech, text classification, and crowd-sourcing for automatic speech recognition.
Speaker bio: Kartik Audhkhasi received B.Tech. in Electrical Engineering and M.Tech. in Information and Communication Technology from Indian Institute of Technology, Delhi in 2008. He is currently pursuing the Ph.D. degree in Electrical Engineering from University of Southern California, Los Angeles. His thesis research focuses on modeling, analysis, and design of ensembles of multiple human or machine experts. He is also interested in crowd-sourcing for speech and language processing. His broad interests include machine learning and signal processing. Kartik is the recipient of the Annenberg, IBM, and Ming Hsieh Institute PhD fellowships, and best teaching assistant awards of the EE department at USC.
|
| 01 Feb 2013 | Abeer Alwan |
Dealing with Limited and Noisy Data in Speech Processing: A Hybrid Knowledge-Based and Statistical Approach
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: In this talk, I will focus on the importance of integrating knowledge of human speech production and speech perception mechanisms, and language-specific information with statistically-based, data-driven approaches to develop robust and scalable speech processing algorithms. The need for such hybrid systems is especially critical when dealing with data corrupted by background acoustic noise, when training data are limited, and when dealing with accents. |
| 25 Jan 2013 | Daniel Marcu |
The Things I Learned While Doing Research in the Commercial World
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: When asked, as a PhD student, what I wanted to do when I grow up, I had one and only one answer: academic-oriented, natural language processing research. During the last decade, I have learned though to also appreciate the research opportunities in the commercial world. In this talk, I will compare several academic and commercial research models and ground the comparison in examples derived from my own experience while working as a researcher for USC, Language Weaver, and SDL
|
| 24 Jan 2013 | Shrikanth Narayanan |
Behavioral Signal Processing: Deriving Human Behavioral Informatics from Multimodal Signals
Time: 3:00 pm - 4:00 pm Location: 6th Floor Conference Room [689] Abstract: Human behavior is exceedingly complex. Its expression and experience are inherently multimodal, and are characterized by individual and contextual heterogeneity. The confluence of sensing, communication and computing is however allowing access to data, in diverse forms and modalities, that is enabling us understand and model human behavior in ways that were unimaginable even a few years ago. No domain exemplifies these opportunities more than that related to human health and wellbeing. Consider for example the domain of Autism where crucial diagnostic information comes from manually-analyzed audiovisual data of verbal and nonverbal behavior. Behavioral signal processing advances can enable not only new possibilities for gathering data in a variety of settings--from laboratory and clinics to free living conditions--but in offering computational models to advance evidence-driven theory and practice. This talk will describe our ongoing efforts on Behavioral Signal Processing (BSP)--technology and algorithms for quantitatively and objectively understanding typical, atypical and distressed human behavior--with a specific focus on communicative, affective and social behavior. Using examples drawn from different application domains, the talk will also illustrate Behavioral Informatics applications of these processing techniques that contribute to quantifying higher-level, often subjectively described, human behavior in a domain-sensitive fashion. [Work supported by NIH, NSF, DARPA, and ONR].
Biography of the Speaker: Shrikanth (Shri) Narayanan is Andrew J. Viterbi Professor of Engineering at USC, where he is Professor of Electrical Engineering, and, jointly in, Computer Science, Linguistics and Psychology. Prior to USC he was with AT&T Bell Labs and AT&T Research. His research focuses on human-centered information processing and communication technologies. He is a Fellow of the Acoustical Society of America, IEEE, and the American Association for the Advancement of Science (AAAS). Shri Narayanan is an Editor for the Computer, Speech and Language Journal and an Associate Editor for the IEEE Transactions on Multimedia, the IEEE Transactions on Affective Computing and the Journal of Acoustical Society of America having previously served an Associate Editor for the IEEE Transactions of Speech and Audio Processing (2000-2004) and the IEEE Signal Processing Magazine (2005-2008). He is a recipient of several honors including the 2005 and 2009 Best Paper awards from the IEEE Signal Processing Society and serving as its Distinguished Lecturer for 2010-11. With his students, he has received a number of best paper awards including winning the Interspeech Challenges in 2009 (Emotion classification), 2011 (Speaker state classification) and in 2012 (Speaker trait classification). He has published over 500 papers and has 13 U.S. patents.
|
| 11 Jan 2013 | Abe Kazemzadeh |
Natural Language Description of Emotion (Ph.D. Thesis Defense Practice Talk)
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: This dissertation studies how people describe emotions with language and how computers can simulate this descriptive behavior. Although many non-human animals can express their current emotions as social signals, only humans can communicate about emotions symbolically. This symbolic communication of emotion allows us to talk about emotions that we may not currently be feeling, for example describing emotions that occurred in the past, gossiping about the emotions of others, and reasoning about emotions hypothetically. Another feature of this descriptive behavior is that we talk about emotions as if they were discrete entities, even though we may not always have necessary and sufficient observational cues to distinguish one emotion from another, or even to say what is and is not an emotion. This motivates us to focus on aspects of meaning that are learned primarily through language interaction rather than by observations through the senses. To capture these intuitions about how people describe emotions, we propose the following thesis: natural language descriptions of emotion are definite descriptions that refer to intersubjective theoretical entities. We support our thesis using theoretical, experimental, computational results. The theoretical arguments use Russell's notion of definite descriptions, Carnap's notion of theoretical entities, and the question-asking period in child language acquisition. The experimental data we collected include dialogs between humans and computers and web-based surveys, both using crowd-sourcing on Amazon Mechanical Turk. The computational models include a dialog agent based on sequential Bayesian belief update within a generalized pushdown automaton, as well as a fuzzy logic model of similarity and subsethood between emotion terms. For future work, we propose a research agenda that includes a continuation of work on the emotion domain as well as new work on other domains where subjective descriptions are established through natural language communication.
Short Bio: Abe Kazemzadeh is a PhD candidate at the USC Computer Science Dept and a research assistant at the the Signal Analysis and Interpretation Laboratory (SAIL). His interests include natural language, logic, emotions, games, and algebra. He is currently the chief technology officer at the USC Annenberg Innovation Laboratory (AIL).
|
| 14 Dec 2012 | Ulf Hermjakob |
Launching Semantics-Based Machine Translation
Time: 3:00 pm - 4:00 pm Location: 11th Floor Conference Room [1135] Abstract: I will present work defining an Abstract Meaning Represention (AMR) (joint work with Kevin Knight et al.) that serves as an intermediate semantic structure when translating between languages such as Chinese and English as well as automatic and manual annotation efforts to build corpora of AMRs. I will give a demo of our web-based AMR Editor, which is used by dozens of annotators at LDC, SDL/LanguageWeaver (Cluj) and other places. Finally, I will give an overview of our initial end-to-end prototype, with rule extraction (own work), decoding from source language to AMR (work by Yinggong Zhao) and AMR to target language generation (Yang Gao).
|
| 07 Dec 2012 | Shu Cai |
Smatch: an Evaluation Metric for Semantic Feature Structures
Time: 3:00 pm - 4:00 pm Location: 11th Floor Conference Room [1135] Abstract: Feature structures are useful for capturing logical semantic relationships. In this talk, we present smatch, a metric that determines semantic overlap between two semantic feature structures. We give an ef.cient algorithm to compute the metric, and we show the results of an inter-annotator agreement study. |
| 16 Nov 2012 | Jerry Hobbs |
Abduction and Metaphor
Time: 3:00 pm - 4:00 pm Location: 11th Floor Conference Room [1135] Abstract: I will talk about recent progress in implementing an efficient method for doing a type of inferencing called abduction, or inference to the best explanation. I will illustrate its wide applicability to a variety of language interpretation problems. I'll describe our recent work on implementing ontologies, or logical theories of commonsense domains. Then I will show how we are applying all this to the interpretation of metaphors. |
| 09 Nov 2012 | Ashish Vaswani and David Chiang |
Neural Networks for NLP
Time: 3:00 pm - 4:00 pm Location: 11th Floor Conference Room [1135] Abstract: Recent years have seen a resurgence of Neural Networks in Natural Language Processing. Much of this success can be attributed to learning compact representations (or embeddings) of words, which are used as input to train standard Neural Network architectures. In the first part of the talk I will describe two approaches for learning word embeddings for large vocabularies. In the second part, I will talk about successful applications of Neural Networks in NLP tasks like Part-Of-Speech tagging, Chunking, Parsing etc. without any feature engineering. I will also describe some preliminary work on Neural Networks for unsupervised Part-Of-Speech tagging. |
| 07 Nov 2012 | Ashish Vaswani and David Chiang |
Neural Networks for NLP
Time: 3:00 pm - 4:00 pm Location: 11th Floor Conference Room [1135] Abstract: Recent years have seen a resurgence of Neural Networks in Natural Language Processing. Much of this success can be attributed to learning compact representations (or embeddings) of words, which are used as input to train standard Neural Network architectures. In the first part of the talk I will describe two approaches for learning word embeddings for large vocabularies. In the second part, I will talk about successful applications of Neural Networks in NLP tasks like Part-Of-Speech tagging, Chunking, Parsing etc. without any feature engineering. I will also describe some preliminary work on Neural Networks for unsupervised Part-Of-Speech tagging. |
| 02 Nov 2012 | Christian Chiarcos |
Linguistic Linked Open Data. Linking Corpora
Time: 3:00 pm - 4:00 pm Location: 11th Floor Conference Room [1135] Abstract: In the last 15 years, the interoperability of language resources has been recognized as a major problem in the development of NLP infrastructures -- partly due to an increased focus on novel, underresourced languages and efforts to bootstrap language resources by annotation projection -- partly due to the increased interest in more abstract levels of linguistic analysis beyond morphosyntax and syntax, namely semantics, reference and discourse. This talk describes the application of Semantic Web formalisms, RDF, OWL/DL and SPARQL, to facilitate the interoperability of linguistic corpora and linguistic annotations. Interoperability of linguistic corpora involves two aspects: Structural interoperability (annotations of different origin are represented using the same formalism) and conceptual interoperability (annotations of different origin are linked to a common vocabulary). I will describe ontology-based approaches for both aspects, the POWLA ontology that defines a data model for annotated corpora, and the Ontologies of Linguistic Annotation (OLiA) that provide definitions for linguistic categories and properties (Chiarcos 2012). As compared to state-of-the-art approaches based on standoff XML, e.g., the recently published ISO standard for an Linguistic Annotation Framework, key advantages of this approach include the existence of a rich technological ecosystem developed around RDF and OWL, including standardized query languages for directed acyclic (multi-) graphs (SPARQL), APIs, data base implementations, as well as the availability of OWL reasoners that can be applied to validate the consistency of linguistic corpora and their annotations and to infer additional information that is relevant, for example, for their appropriate visualization. Naturally, representing corpora in OWL and RDF also allows to interlink resources freely, e.g., different annotation layers of a multi-layer corpus, translated texts in parallel corpora, or linguistic corpora and lexical-semantic resources. Modeled in this way, corpora can be fully integrated in a Linked Open Data (sub-)cloud of linguistic resources, along with lexical-semantic resources and knowledge bases of information about languages and linguistic terminology. The second part of my talk will introduce recent efforts to create a Linked Open Data sub-cloud of linguistic resources, the Linguistic Linked Open Data cloud (Chiarcos et al. 2012, cf. http://linguistics.okfn.org). References Christian Chiarcos, Sebastian Hellmann, Sebastian Nordhoff, et al. (2012), The Open Linguistics Working Group, Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC-2012). Istanbul, Turkey, May 2012. [http://www.lrec-conf.org/proceedings/lrec2012/pdf/912_Paper.pdf] Christian Chiarcos (2012), Interoperability of Corpora and Annotations, In: Christian Chiarcos, Sebastian Nordhoff, and Sebastian Hellmann (eds.) Linked Data in Linguistics. Representing and Connecting Language Data and Language Metadata. Springer, Heidelberg. [http://www.springer.com/computer/ai/book/978-3-642-28248-5] Bio
Christian Chiarcos studied Computer Science and General Linguistics at
the Technical University Berlin, Germany, and received his PhD in
Computational Linguistics from the University of Potsdam, Germany in
2010. He is currently affiliated with the University of Frankfurt/M.,
Germany. Since April 2012, he is visiting scholar at the ISI. His
primary areas of expertese include the study and modeling of discourse
semantics, as well as the development of infrastructures for rich and
heterogeneous linguistic annotations.
|
| 31 Oct 2012 | Marcello Federico (FBK Trento, Italy), Marco Trombetti (Translated srl, Rome - Italy) |
Towards the integration of human and machine translation
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: We will given an overview of the challenges and early results of an EC-funded project, named MateCat, whose goal is developing an enhanced web-based CAT tool integrating new MT functionalities. In particular, MateCat will investigate the integration of MT into the CAT working process along three main directions: self-tuning MT, user adaptive MT, and informative MT. In this seminar, we will report on recent activities concerning domain and on-line MT adaptation and will introduce the first version of the MateCat tool, that will be officially released in open source by the end of the year. |
| 29 Oct 2012 | Douglas W. Oard, University of Maryland |
Evaluating E-Discovery Search: The TREC Legal Track
Time: 2:00 pm - 3:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: Civil litigation in this country relies on each side making relevant evidence available to the other, a process known as "discovery." The explosive growth of information in digital form has led to an increasing focus on how search technology can best be applied to balance costs and responsiveness in what has come to be known as "e-discovery". This is now a multi-billion dollar business, one in which new vendors are entering the market frequently, usually with impressive claims about the efficacy of their products or services. Courts, attorneys, and companies are actively looking to understand what should constitute best practice, both in the design of search technology and in how that technology is employed. In this talk I will provide an overview of the e-discovery process, and then I will use that background to motivate a discussion of which aspects of that process the TREC Legal Track is seeking to model. I will then spend most of the talk describing two novel aspects of evaluation design: (1) recall-focused evaluation in large collections, and (2) modeling an interactive process for "responsive review" with fairly high fidelity. Although I will draw on the results of participating teams to illustrate what we have learned, my principal focus will be on discussing what we presently understand to be the strengths and weaknesses of our evaluation designs. About the Speaker:
Douglas Oard is a Professor at the University of Maryland, College Park, with joint appointments in the College of Information Studies and the Institute for Advanced Computer Studies, where he currently serves as director of the Computational Linguistics and Information Processing lab. Dr. Oard earned his Ph.D. in Electrical Engineering from the University of Maryland. His research interests center around the use of emerging technologies to support information seeking by end users. His recent work has focused on interactive techniques for cross-language information retrieval, searching conversational media such as speech and email, evaluation design for e-discovery in the TREC Legal Track, and support for sense-making in large digital archival collections. Additional information is available at http://terpconnect.umd.edu/~oard/.
|
| 26 Oct 2012 | Philipp Koehn |
Computer Aided Translation
Time: 3:00 am - 4:00 pm Location: 10th Floor Conference Room [1026] Abstract: Despite all the recent successes of machine translation, when it comes to high quality publishable translation, human translators are still unchallenged. Since we can't beat them, can we help them to become more productive? I will talk about some recent work on developing assistance tools for human translators. You can also check out a prototype at http://www.caitra.org/ |
| 19 Oct 2012 | Marc Schulder |
Metaphor Detection through Term Frequency
Time: 3:00 am - 4:00 pm Location: 11th Floor Conference Room [1135] Abstract: Metaphors are used to replace complicated or unfamiliar ideas with familiar, yet unrelated concepts that share an important attribute with the intended idea. The result is a conceptual mapping between metaphoric source and literal target meaning. Computational metaphor processing is divided into detection and interpretation. To detect metaphors, most existing approaches attempt to identify these conceptual mappings. They require resources for the source (metaphor) as well as the target domain, and a set of defined mappings between the two. Creating these resources is expensive and limits the scope of these systems They are also usually restricted to well-observed, conventionalized metaphors, and can not deal with neologisms. Since metaphors are a productive area of language, this is a major shortfall. We propose a statistical approach to metaphor detection that utilizes the uncommonness of novel metaphors. Words that do not match a text's typical vocabulary are highlighted as metaphor candidates. No knowledge of semantic concepts or the metaphor's source domain is required for this. We analyze the performance of this approach as an unsupervised standalone classifier and as a feature in a supervised graphical model.
|
| 12 Oct 2012 | Jagadeesh Jagarlamudi |
Discriminative Interlingual Representations for NLP
Time: 11:00 am - 12:00 pm Location: 11th Floor Conference Room [1135] Abstract: The language barrier in many of the multilingual natural language processing (NLP) tasks, such as name transliteration, mining bilingual word translations, etc., can be overcome by mapping objects (names and words in the respective tasks) from different languages (or views) into a common low-dimensional subspace. Multi-view models learn such a low-dimensional subspace using a training corpus of paired objects, e.g. name pairs written in different languages. The central idea of my dissertation is to learn low-dimensional subspaces (or interlingual representations) that are effective for various multilingual and monolingual NLP tasks. First, I demonstrate the effectiveness of interlingual representations in mining bilingual word translations for machine translation, and then proceed to developing models for diverse situations that often arise in NLP tasks. In particular, I design models for 1) bridge setting -- when there are more than two views but we only have training data from a single pivot view into each of the remaining views 2) reranking setting -- when an object from one view is associated with a ranked list of objects from another view, and finally 3) when the underlying objects have rich structure, such as a tree.
These problem settings arise frequently in real world applications. I choose a canonical task for each of the settings and compare my model with existing state-of-the-art baseline systems. I provide empirical evidence for the first two models on multilingual name transliteration and the part-of-speech tagging tasks, respectively. For the third problem setting, I discuss my ongoing work on vector based compositionality learning task. This task aims to find the meaning, represented as a vector in d-dimensional space, of a sentence or a phrase based on the meaning of its constituent words.
|
| 10 Oct 2012 | Victoria Fossum |
Sequential vs. hierarchical syntactic models of human sentence processing
Time: 2:00 pm - 3:00 pm Location: 6th Floor Conference Room [689] Abstract: Human incremental sentence processing is the process by which we read a sentence, word-by-word, and ultimately comprehend its meaning. A central question in sentence processing research is to understand the precise nature of the linguistic representations that we construct while comprehending a sentence. Experimental evidence demonstrates that syntactic structure plays a role in these representations. But open questions remain about the type of syntactic structure that is most relevant to the human sentence processing mechanism: is this syntactic structure sequential or hierarchical? Does it include lexical information (in which case it is "lexicalized"), or is lexical information processed independently from the syntactic structure (in which case the syntactic structure is "unlexicalized")? A previous study (Frank and Bod, 2011) compared unlexicalized sequential and hierarchical models of human sentence processing, and found that sequential models explain observed human behavior (e.g. eye movements) during sentence processing better than hierarchical models. The authors concluded that the human sentence processing mechanism is insensitive to hierarchical syntactic structure. We investigate this claim, and find a picture that is more complicated than the one presented by the previous study. First, we show that lexicalized syntactic models explain observed human behavior during sentence processing better than unlexicalized syntactic models. Second, we consider a broader set of sequential and hierarchical models, and show that the findings of (Frank and Bod, 2011) do not generalize to this broader set. Finally, we show why, even within the set of models considered by (Frank and Bod, 2011), their findings are not entirely conclusive. Our results indicate that the claim that the human sentence processing mechanism is insensitive to hierarchical syntactic structure is premature.
|
| 05 Oct 2012 | Dirk Hovy |
Learning Whom to Trust with MACE
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: |
| 06 Jul 2012 | Stephan Gouws (Stellenbosch University) |
Projecting features across domains using deep learning
Time: 3:00 pm - 4:00 pm Location: 11th Floor Conference Room [1135] Abstract: Over the last few years, neural network-based deep-learning models achieved good results in various NLP tasks, such as language modelling, POS tagging, parsing, chunking, and NER. In contrast to discrete models like HMMs, neural models operate by jointly learning continuous input representations (embeddings), and the model to interpret them. These embeddings represent words and/or phrases in a lower-dimensional, latent, syntactic-semantic space and can often be learned in an unsupervised manner. We aim to exploit this property of deep learning to transfer knowledge from resource-rich to resource-poor domains. We facilitate the transfer of knowledge by constraining the learned embeddings of both domains to share as much structural similarity as possible. I will discuss preliminary results for noisy text normalization in Twitter, where the task is to transfer the correct clean words from English to the noisy Twitter domain, and review the main deep learning models for NLP (Bengio et al. (2003, Mnih and Hinton (2007), Collobert and Weston (2008), Mikolov et al. (2010), and Socher et al. (2011)). Bio:
Stephan Gouws is a PhD student at Stellenbosch University in South Africa. He is currently on a short-term visit at the ISI. His main research focus is on developing robust, semi-supervised techniques for processing language in and across noisy domains. In 2011 he was also on a 6-month visit to the ISI during which he worked on orthographic normalization of non-standard Twitter text.
|
| 03 Jul 2012 | Ashish Vaswani |
Smaller Alignment Models for Better Translations: Unsupervised Word Alignment with the l0-norm
Time: 3:00 pm - 4:00 pm Location: 4th Floor Conference Room Abstract: Two decades after their invention, the IBM word-based translation models, widely available in the GIZA++ toolkit, remain the dominant approach to word alignment and an integral part of many statistical translation systems. Although many models have surpassed them in accuracy, none have supplanted them in practice.We propose a simple extension to the IBM models: an l0 prior to encourage sparsity in the word-to-word translation model. This extension has been implemented in GIZA++ and scales to large-scale data . We achieve significant improvements over IBM Model 4 in both word alignment and translation quality. This is a practice talk for ACL. Bio:
Ashish Vaswani is a PhD student at ISI.
|
| 29 Jun 2012 | Bevan Jones |
Semantic Parsing with Bayesian Tree Transducers
Time: 3:00 pm - 4:00 pm Location: 4th Floor Conference Room Abstract: Many semantic parsing models use tree transformations to map between natural language and meaning representation. However, while tree transformations are central to several state-of-the-art approaches, little use has been made of the literature on tree automata, which could both clarify the relationships between different approaches and increase the generality of new contributions. We attempt to clarify the relationship by presenting a tree transducer model that is closely related to previous work made without appealing to automata theory. We then describe a variational Bayesian inference algorithm that is applicable to a wide class of tree transducers, producing state-of-the-art semantic parsing results when coupled with our model while remaining applicable to any domain employing probabilistic tree transducers (not just semantic parsing). This is joint work with Mark Johnson and Sharon Goldwater to be presented at this year’s ACL Bio:
I research computational models of language acquisition, exploring questions of how linguistic structure and meaning might interact during learning. For instance, I have worked on Bayesian models of unsupervised word segmentation, exploring how simultaneous word meaning acquisition influences the identification of lexical boundaries. Currently, I work on semantic parsing, using a combination of Bayesian techniques and automata theory to model more complex structural relationships between compositional meaning and syntactic structure. My PhD began at the department of Cognitive, Linguistic and Psychological Sciences at Brown University but has since moved to the School of Informatics at the University of Edinburgh and the Computing Department of Macquarie University.
|
| 22 Jun 2012 | Vita Markman (Disney Interactive) |
Discovering Latent Similarities in Car Models Based On Customer Reviews: Towards a Consumer-Driven Product Recommendation System
Time: 3:00 pm - 4:00 pm Location: 11th Floor Conference Room [1135] Abstract: This pilot study explores the hypothesis that customer reviews of cars can be used to create and/or fine tune a recommendation system that offers a list of ranked top-N matches for a given vehicle. Our main premise is that positive or negative reviews invariably focus on the features relevant to the car being reviewed and hence can be used to uncover subtle similarities among various car models, as well as discover macro-types of cars (e.g. family cars, luxury, high performance sports etc). To discover similar models based on reviews we propose a Weighted Dice Coefficient which weighs each shared or non-shared word token by its tf-idf score. Closest top five cars are then discovered for each of the 226 reviewed car models. We also show that integrating tf-idf scores into the similarity metric improves the accuracy of the top five picks, as compared to the standard Dice Coefficient. Bio:
I graduated from Rutgers in 2005 with a PhD in Linguistics. Having taught linguistics at Pomona College and Simon Fraser University between 2006 and 2008, I moved into industry in 2008. I currently work as a Computational Linguist at Disney Interactive Media Group. My work primarily concerns developing natural language processing techniques to ensure that the content of Disney's online chat is safe for kids. My work involves developing various NLP methods that filter online chat for inappropriate content, while taking into account the vast informality, sparsity, and noise of the on-line child chat language. In addition, I conduct independent research on Twitter data, specifically clustering one-line micro-tweets by topic. My additional research includes mining online car reviews to identify common car-types based on the features people rate as positive or negative.
|
| 25 May 2012 | Liang Huang |
Structured Perceptron with Inexact Search (NAACL HLT Practice Talk)
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: Most existing theory of structured prediction assumes exact inference, which is often intractable in many practical problems. This leads to the routine use of approximate inference such as beam search but there is not much theory behind it. Based on the structured perceptron, we propose a general framework of "violation-fixing" perceptrons for inexact search with a theoretical guarantee for convergence under new separability conditions. This framework subsumes and justifies the popular heuristic "early-update" for perceptron with beam search (Collins and Roark, 2004). We also propose several new update methods within this framework, among which the "max-violation" method dramatically reduces training time (by 3 fold as compared to early-update) on state-of-the-art part-of-speech tagging and incremental parsing systems. |
| 18 May 2012 | Dirk Hovy |
Exploiting Partial Annotations with EM Training (NAACL HLT Practice Talk)
Time: 3:30 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: For many NLP tasks, EM-trained HMMs are the common models. However, in order to escape local maxima and find the best model, we need to start with a good initial model. Researchers suggested repeated random restarts or constraints that guide the model evolution. Neither approach is ideal. Restarts are time-intensive, and most constraint-based approaches require serious re-engineering or external solvers. In this paper we measure the effectiveness of very limited initial constraints: specifically, annotations of a small number of words in the training data. We vary the amount and distribution of initial partial annotations, and compare the results to unsupervised and supervised approaches. We find that partial annotations improve accuracy and reduce the need for random restarts, which speeds up training time considerably. |
| 18 May 2012 | Jason Riesa |
Automatic Parallel Fragment Extraction From Noisy Data (NAACL HLT Practice Talk)
Time: 3:00 pm - 3:30 pm Location: 11th Floor Large Conference Room [1135] Abstract: We present a novel method to detect parallel fragments within noisy parallel corpora. Isolating these parallel fragments from the noisy data in which they are contained frees us from noisy alignments and stray links that can severely constrain translation-rule extraction. We do this with existing machinery, making use of an existing word alignment model for this task. We evaluate the quality and utility of the extracted data on large-scale Chinese-English and Arabic-English translation tasks and show significant improvements over a state-of-the-art baseline. |
| 03 May 2012 | Dirk Hovy |
Using Syntactic Information for Unsupervised Relation Extraction and Typing (Thesis Proposal Practice Talk)
Time: 4:00 pm - 5:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: Question Answering (QA) is a longstanding goal in Natural Language Processing (NLP). In its simplest form, QA relies on keyword matching to find single-word answers (e.g., search engines). But single words taken out of context are ambiguous -- only context disambiguates them. This meaningful context comes in the form of syntactic and/or semantic relations between predicates and arguments. Relations are thus at the core of meaning and information. Systems like Siri or Watson have put QA in more widespread use, and users move away from single-word questions to more complex ones. Finding and classifying relations to answer those questions will thus become the central challenge for future QA systems. The large number of relations makes relation extraction challenging; given a sentence, many possible relations can be extracted. If we can specify the relations we are interested in beforehand, we can annotate data to train supervised systems. Often though, definition beforehand is impossible, and we have to find all possible relations that hold in a text. In those cases, we must rely on unsupervised approaches. A second problem is rapid adaptation to new domains and topics. Relations extracted from one domain may not be relevant to another. A third problem is variation in the ways relations are expressed in text. Often, intervening words and phrases between predicates and arguments cause fixed-window pattern matching approaches to fail. Most previous relation extraction approaches have either relied on annotated data or (semi-) structured sources of information. These approaches require pre-defined relations and manually annotated data. Furthermore, many of these approaches rely on pattern matching over surface strings, which is not robust to variations. If previous approaches used unsupervised training methods, they largely focused on clustering, effectively ignoring sequential structure in the data. The future of QA will require us to quickly adapt to new domains and topics with little annotated data. Only if we can discover and disambiguate relations automatically can we build systems capable of open-ended QA. I present several techniques for discovering relations from text. I show how to use unsupervised sequential models to discover relations from raw text. These methods do not require any existing resources, manual annotation, or pre-defined relations, and can be applied to any domain. I use dependency parse structures as inputs to these methods, making these approaches more robust to surface variations. I show improvements over state-of-the-art systems as well as novel approaches to fully exploit the structure contained in the data.
|
| 27 Apr 2012 | Christian Chiarcos (Uni Potsdam) |
Towards operationalizable models of discourse phenomena: Addressing discourse relations
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: The modeling of discourse has been a major topic of research in the linguistics and AI communities for decades. With respect to language, discourse phenomena refer to the use of linguistic indicators that reflect the functional organization of utterances, relationships between different utterances, with the interlocutors' state of mind and with the situational surrounding. The development of models of discourse that are operationalizable (as a part of NLP applications) is essential, for example, in machine translation: * to interpret, to translate and to generate pronouns, definite and indefinite NPs correctly, * to translate non-canonical constructions (e.g., passive), * to generate the correct word order (e.g., when translating into a free-word order language), * to insert or to drop discourse markers and conjunctions, or * to choose the appropriate type of syntactic embedding in complex sentences. In other branches of NLP, different aspects of discourse are important, e.g., relations between utterances (machine reading), the hierarchical organization of discourse (text summarization) and the sequential organization of utterances in a text (text structuring/natural language generation). Numerous models of different aspects of discourse have been proposed, including discourse structure (the hierarchical organization of utterances in discourse), discourse relations (relations between independent utterances in discourse), information structure (the functional structure of utterances in context), and information status (accessibility of antecedents of pronouns, definite descriptions and elliptic constructions). These approaches range from relatively abstract models from cognitive and functional linguistics (e.g., Givon 1983), over elaborate formal models developed in formal semantics (e.g., Asher 1993), to "parameterized", rule-based models in AI (e.g., Grosz et al. 1995). Since the mid-1990s, this traditional, "theory-centered" line of research has been complemented with an "annotation-centered" methodology, i.e., the development and the use of annotated corpora to test predictions and to develop statistical classifiers. In the first part of the talk, I describe selected activities of the applied computational linguistics group at the University of Potsdam/Germany in this direction, which include * the annotation of discourse structure, coreference, information structure and information status (Stede 2004, Krasavina and Chiarcos 2007, Ritz et al. 2008) * the development of generic multi-layer architectures capable to represent and to access these annotations along with other types of annotation applied to the same stretch of data (Chiarcos et al. 2008), e.g., annotations for constituent syntax, dependency syntax, or frame semantics, and * the application of machine learning techniques to predict discourse features from less abstract annotation layers (Ritz 2007, Chiarcos 2011). The primary drawback of annotation-centered models are the immense cognitive (and thus, financial) efforts necessary to produce reliable discourse annotations. One way to address this problem is to make use of corpora without discourse annotations to test predictions of candidate models, and to develop unsupervised or weakly supervised approaches to support or to replace manual annotation. In the second part of my talk, this "data-centered" approach on discourse will be illustrated for the example of discourse relations, one of the main topics of my work at ISI. I describe a pilot study that shows that significant, reproducible and interpretable insights about the discourse relation (that is likely to be) connecting a pair of events can be achieved from a sufficiently large corpus with syntax annotations only. Further, possible lines for subsequent research will be sketched.
Nicholas Asher (1993). Reference to Abstract Objects in Discourse. Kluwer, Dordrecht, 1993. Christian Chiarcos (2011). Evaluating salience metrics for the context-adequate realization of discourse referents. In: Proceedings of the 13th European Workshop on Natural Language Generation (ENLG 2011). Association of Computational Linguistics, Nancy, France, Sep 2011, 32-43. Christian Chiarcos, Stefanie Dipper, Michael Gotze, Ulf Leser, Anke Lüdeling, Julia Ritz, and Manfred Stede (2008). A Flexible Framework for Integrating Annotations from Different Tools and Tagsets. TAL (Traitement automatique des langues) 49 (2): 218-248. Talmy Givon (ed., 1983). Topic Continuity in Discourse: A Quantitative Cross-Language Study. John Benjamins, Amsterdam and Philadelphia. Barbara J. Grosz, Aravind K. Joshi, and Scott Weinstein (1995). Centering: A framework for modelling the local coherence of discourse. Computational Linguistics, 21(2):203–225. Olga Krasavina and Christian Chiarcos (2007). PoCoS - Potsdam Coreference Scheme. In Proceedings of the Linguistic Annotation Workshop. Held in Conjunction with the ACL-2007, Prague, Czech Republic, pages 156–163. Julia Ritz, Svetlana Petrova, Michael Götze, and Stefanie Dipper (2007). Automatic Identification of Information Structure in Small Corpora of Modern and Old High German. GLDV-Fruhjahrstagung 2007, Tubingen, Germany. Julia Ritz, Stefanie Dipper, und Michael Götze (2008). Annotation of Information Structure: An Evaluation Across Different Types of Texts. In Proceedings of the the 6th LREC conference. Marrakech, Morocco. Manfred Stede (2004). The Potsdam Commentary Corpus. In Bonnie Webber and Donna K. Byron, editors, Proceedings of the ACL-2004 Workshop on Discourse Annotation, Barcelona, pages 96–102.
Biography: Christian Chiarcos, born 1977, studied Computer Science (MSc, 2002) and General Linguistics (MA, 2004) at the Technical University Berlin, Germany. From 2002 to 2003, he received a scholarship in the context of the project "Collocations in Dictionary" at the Berlin-Brandenburg Academy of Science under the auspicion of Christiane Fellbaum (Princeton). From 2003 to 2005, he participated in the graduate school "Economy and Complexity in Language" at the Humboldt-Unversity at Berlin and the University of Potsdam, Germany, where he developed a corpus-based approach to predict syntactic alternations for Natural Language Generation. This research formed the basis for his PhD thesis "Mental Salience and Grammatical Form" (University of Potsdam, 2010).
Since 2006, he worked in the Applied Computational Linguistics group at the University of Potsdam, Germany, where he participated in different research projects dedicated to the development of interoperable infrastructures for NLP and multi-layer corpora. Since 2007, this research was carried out in the context of the Collaborative Research Center "Information Structure", a multidisciplinary network of projects at the University of Potsdam and the Humboldt-University Berlin, dedicated to the study of discourse phenomena.
|
| 16 Mar 2012 | Jason Riesa |
Syntactic Alignment Models for Large-Scale Translation (PhD Defense Practice Talk)
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: Word alignment, the process of inferring the implicit links between words across two languages, serves as an integral piece of the puzzle of learning linguistic translation knowledge. It enables us to acquire automatically from data the rules that govern the transformation of words, phrases, and syntactic structures from one language to another. Word alignment is used in many tasks in Natural Language Processing, such as bilingual dictionary induction, cross-lingual information retrieval, and distilling parallel text from within noisy data. In this talk, we focus on word alignment for statistical machine translation.
We advance the state-of-the-art in search, modeling, and learning of alignments and show empirically that, when taken together, these contributions significantly improve the output quality of large-scale statistical machine translation, outperforming existing methods. The work we describe may be used for any language-pair, supporting arbitrary and overlapping features from varied sources.
|
| 17 Feb 2012 | Adam Pauls (UC Berkeley) |
Large Scale Syntactic Language Modeling with Treelets
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: We propose a simple generative syntactic language model that conditions on overlapping tree contexts in the same way that n-gram language models condition on overlapping sentence context. We estimate the parameters of our model by collecting counts from automatically parsed text using standard n-gram language model estimation techniques, allowing us to train a model on over one billion tokens of data using a single machine in a mater of hours. We evaluate on a range of grammaticality tasks, and find that we consistently outperform n-gram models and other generative baselines, and even compete with state-of-the-art discriminative models hand-designed for each task, despite training on positive data alone. We also show some improvements in preliminary machine translation experiments. |
| 10 Feb 2012 | Liang Huang |
Efficient Search and Learning for Language Understanding and Translation
Time: 3:00 pm - 4:00 pm Location: 6th Floor Large Conference Room [689] Abstract: What is in common between translating from English into Chinese and compiling C++ into machine code? And yet what are the differences that make the former so much harder for computers? How can computers learn from human translators? This talk sketches an efficient (linear-time) "understanding + rewriting" paradigm for machine translation inspired by both human translators as well as compilers. In this paradigm, a source language sentence is first parsed into a syntactic tree, which is then recursively converted into a target language sentence via tree-to-string rewriting rules. In both "understanding" and "rewriting" stages, this paradigm closely resembles the efficiency and incrementality of both human processing and compiling. We will discuss these two stages in turn. First, for the "understanding" part, we present a linear-time approximate dynamic programming algorithm for incremental parsing that is as accurate as those much slower (cubic-time) chart parsers, while being as fast as those fast but lossy greedy parsers, thus getting the advantages of both worlds for the first time, achieving state-of-the-art speed and accuracy. But how do we efficiently learn such a parsing model with approximate inference from huge amounts of data? We propose a general framework for structured prediction based on the structured perceptron that is guaranteed to succeed with inexact search and works well in practice.
Next, the "rewriting" stage translates these source-language parse
trees into the target language. But parsing errors from the previous
stage adversely affect translation quality. An obvious solution is to
use the top-k parses, rather than the 1-best tree, but this only helps
a little bit due to the limited scope of the k-best list. We instead
propose a "forest-based approach", which translates a packed forest
encoding *exponentially* many parses in a polynomial space by sharing
common subtrees. Large-scale experiments showed very significant
improvements in terms of translation quality, which outperforms the
leading systems in literature. Like the "understanding" part, the
translation algorithm here is also linear-time and incremental, thus
resembles human translation.
|
| 13 Jan 2012 | Hercules Dalianis (Stockholm University) |
Reusing clinical documentation for better health
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: Today a large number of Electronic Patient Records (EPRs) are produced for legal reasons but they are very seldom reused, neither for clinical research nor for business (hospital) intelligence reasons. Moreover, the clinician's daily work in documenting the patient status is not always supported in a proper way. Hospital management needs key and real time information of the health care processes. Simultaneously, patients have become more demanding customers that want to be involved in their own health care process. We are aiming to support these demands. Clinical documentation forms an abundant source to extract valuable information that can be used for this purpose, however clinical corpora contain protected health information and must be kept in a safe way. Today only in Sweden (with a population of 10 million) 4-10 million pages of patient records are produced each year. We have studied the Stockholm EPR Corpus, a huge clinical document collection written in Swedish, containing over one million patient records. The document collection is distributed over 900 clinics from the Stockholm area encompassing three years 2006-2008. We have used this clinical corpus as a knowledge base to develop a set of tools that can work as basic building blocks for the future tools for health engineering. We have been assisted by physicians that have interpreted the content in the clinical text to us, they have annotated the clinical text and they have also set requirements on these tools together with their colleagues. We have identified four groups of users in the health domain: physicians, clinical researchers, hospital management and patients. We will show examples on these tools and the benefits they will give to health care. 1) For physicians: Automatic ICD-10 assignment 2) For clinical researchers: Comorbidity networks 3) For hospital management: ICD-10 validation and adverse event detection, and finally 4) For patients: automatic text summarization. Brief Bio: Dr. Hercules Dalianis, Professor, born 20 July 1959 Dalianis is a professor in Computer and Systems Sciences at Stockholm University. Dalianis received his Ph.D in 1996. Dalianis was a postdoc researcher at University of Southern California/ISI in Los Angeles in 1997. Dalianis was also postdoc researcher (forskarassistent) at KTH-Royal Institute of Technology in Stockholm, 1999-2003. Dalianis held a three year guest professorship at CST, University of Copenhagen during 2002-2005, founded by the Norfa, the Nordic council. Dalianis works in the interface between industry and university and with the aim to make research results useful for society. Dalianis has specialized in the area of human language technology, to make computers understand and process human language text, but also to make computers produce text automatically. Currently Dalianis is working in the area of clinical text mining with the aim to improve health care in form of better electronic patient record systems, presentation of the patient records and extraction of valuable information for clinical researchers as well as for the patients.
|
| 16 Dec 2011 | Chris Dyer (Carnegie Mellon) |
Generate-and-Test Models for Alignment and Machine Translation
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: I discuss translation as an optimization problem subject to three kinds of constraints: lexical, configurational, and constraints enforcing target-language wellformedness. Lexical constraints ensure that the lexical choices in the output are meaning-preserving; configurational constraints ensure that the relationships between source words and phrases (e.g., semantic roles and modifier-head relationships) are properly transformed in translation; and target-language wellformedness constraints ensure the grammaticality of the output. In terms of the traditional source-channel model of Brown et al. (1993), the "translation model" encodes lexical and configurational constraints and the "language model" encodes target language wellformedness constraints. On the other hand, the constraint-based framework suggests a generate-and-test (discriminative) model of translation in which features sensitive to input and output structures, and the feature weights are trained to maximize the (conditional) likelihood of a corpus of example translations. The specified features represent empirical hypotheses about what variables correlate (but not why) and thus encode domain-specific knowledge that is useful for the problem at hand; the learned weights indicate to what extent these hypotheses are confirmed or refuted. To verify the usefulness of the feature-based approach, I discuss the performance two models: first, a lexical translation model evaluated by the word alignments it learns. Unlike previous unsupervised alignment models, the new model utilizes features that capture diverse lexical and alignment relationships, including morphological relatedness, orthographic similarity, and conventional co-occurrence statistics. Results from typologically diverse language pairs demonstrate that the generate-and-test model provides substantial performance benefits compared to state-of-the-art generative baselines. Second, I discuss the results of an end-to-end translation model in which lexical, configurational, and wellformedness constraints are modeled independently. Because of the independence assumptions, the model is substantially more compact than state-of-the-art translation models, but still performs significantly better on languages where source-target word order differences are substantial.
Bio: Chris Dyer is a postdoctoral researcher in Noah Smith's lab in
the Language Technologies Institute at Carnegie Mellon University. He
completed his PhD on statistical machine translation with Philip
Resnik at the University of Maryland in 2010. Together with Jimmy Lin,
he is author of "Data-Intensive Text Processing with MapReduce",
published by Morgan & Claypool in 2010. Current research interests
include machine translation, unsupervised learning, Bayesian
techniques, and "big data" problems in NLP.
|
| 12 Dec 2011 | Gael Dias (University of Caen Basse-Normandie, France) |
Cross Domain Subjectivity Classification using Multi-View Learning
Time: 4:00 pm - 5:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: In this talk, we will present our research on learning models with high cross domain accuracy for subjectivity classification. After a small introduction about related works and challenges of sentiment analysis, we will start by presenting new features for subjectivity analysis. Then, we will present two different paradigms of multi-view learning strategies to learn transfer models: multi-view learning with agreement and guided multi-view learning. Then, we will present an exhaustive evaluation based on both paradigms including two states-of-the-art algorithms and show that accuracy over 91% can be obtained using three views. In our concluding remarks, we will talk about future extensions of the presented methodology. Then, we will briefly present the Human Language Technology team of the GREYC Laboratory of the University of Caen Basse-Normandie (France) and present projects that are being studied ans further prospects. Biography: Gael Dias is full professor at the University of Caen Basse-Normandie (France). His research interests include unsupervised methodologies for text mining, information retrieval and text summarization. His recent research focuses on Sentiment Analysis, Ontology Learning, Lexical Semantics, Web Personalization and Collaboration, Temporal Information Retrieval, and Paraphrase Extraction and Identification. He has served on program committees of international conferences and workshops such as ACL/HLT 2011, COLING 2010, IJCNLP/ACL 2009, ACL 2007, HLT-NAACL 2007, COLING/ACL 2006 as well as is/was a reviewer for Information Processing and Management, IEEE Transactions on Audio, Speech and Language Processing, Natural Language Engineering Journal, Journal of Language Resources and Evaluation, Journal of Computer Speech and Language and ACM Transactions on Speech and Language Processing.
|
| 04 Nov 2011 | Ariya Rastrow (Johns Hopkins) |
Going beyond n-grams: Incorporating non-local dependencies for Speech Recognition
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: Due to the availability of large amounts of training data and computational resources, building more complex models with sentence level knowledge and longer dependencies has been an active area of research in automatic speech recognition (ASR). Yet, due to the complexity of the speech recognition task, integration of many of these complex and sophisticated knowledge sources into the first decoding pass is not feasible. Many of these long-span models cannot be represented as weighted finite-state automata (WFSA), making it difficult even to incorporate them in a lattice rescoring pass. First, we motivate our work by providing compelling empirical evidence that n-gram LMs are not sufficient for ASR task and why we need to incorporate non-local features such as syntax. The development of language models with such long-span (non-local) features is underway, but is not addressed in this talk. We instead address how such models should be trained discriminatively and applied effectively. Specifically, we describe a new approach for rescoring speech lattices with such models (acoustic or language) that does not entail computationally intensive lattice expansion or limited rescoring of only an N -best list. We view the set of word-sequences in a lattice as a discrete space and develop a hill climbing technique to start with, say, the 1-best hypothesis under the lattice-generating model(s) and iteratively improve it using the new model. We demonstrate empirically that to achieve the same reduction in error rate using a better estimated, higher order LM, our technique evaluates fewer hypotheses than conventional N-best rescoring by up to two orders of magnitude. We also propose to integrate the idea of hill climbing into the training of discriminative language models with non-local sentence level features. Discriminative models provide the flexibility to include both local n-gram features and arbitrary sentence level features. However, unlike generative LMs with long-span dependencies where one has to resort to N-best lists only during decoding (rescoring), discriminative models force the use of N-best lists even for LM training. We demonstrate significant computational saving during training as well as error-rate reduction over N-best training methods. Bio: Ariya Rastrow is a Ph.D. candidate at Johns Hopkins University, working with Sanjeev Khudanpur and Mark Dredze. He was initially advised by Fred Jelinek. The focus of his PhD research is to advance speech recognition systems to efficiently incorporate linguistically motivated non-local features into language models. In his recent work, he has developed an efficient hill-climbing algorithm to apply non-local complex models for the speech recognition task. He has also worked on out-of-vocabulary (OOV) detection, spoken term detection and semi-supervised adaptation techniques for speech recognition.
|
| 07 Oct 2011 | Ekaterina Ovchinnikova |
Integration of World Knowledge for Natural Language Understanding
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: Traditional inference-based natural language understanding (NLU) in a computational framework suffered mainly from a lack of a sufficiently large knowledge base of commonsense knowledge. Recent advances have changed this situation: A large amount of machine-readable knowledge is now freely available to the community. This talk focuses on exploiting these developments to model large-scale NLU in an inference-based framework. The three main types of the existing knowledge sources are lexical-semantic dictionaries, distributional resources, and ontologies. After comparing these types of resources and outlining their differences, I will present an integrative knowledge base combining lexical-semantic, ontological, and distributional knowledge in a modular way. I will then talk about reasoning procedures able to make use of the large scale knowledge base. In particular, I will compare two main forms of logical inferences applied to NLU: deduction and abduction.
In the last part of the talk, I will present experiments on the
following knowledge-intensive NLU tasks: recognizing textual
entailment, semantic role labeling, and paraphrasing of noun-noun
dependencies.
|
| 04 Oct 2011 | Steve DeNeefe |
Tree-adjoining Machine Translation (Ph.D. Defense Practice Talk)
Time: 4:00 pm - 5:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: Machine Translation (MT) is the task of translating a document from a source language (e.g., Chinese) into a target language (e.g., English) via computer. State-of-the-art statistical approaches to MT use large collections of human-translated documents as training material, gathering statistics on the patterns of correspondence between languages according to the features specified by the translation model. Using this bilingual translation model in conjunction with a target language model, created by gathering statistics from a large monolingual corpus, a new document in the source language can be automatically translated into its target-language equivalent with surprising accuracy. Much MT research focuses on types of the patterns and features to include in a translation model. Recent statistical MT models have used syntax trees to enforce grammaticality, but the currently popular tree substitution models only memorize sequences of words or constituents, specifying exactly what phrases to use and exactly what trees are grammatical, which does not generalize well. Adding the operation of tree-adjoining provides the freedom to splice additional information into an existing grammatical tree. An adjoining translation model allows general, linguistically-motivated translation patterns to be learned without the clutter of endless variations of optional material. The appropriate modifiers, such as adjectives, adverbs, and prepositional phrases, can be grafted into these core patterns as needed to translate details. We show that the increased generalization power provided by adjoining, when used carefully, improves MT quality without becoming computationally intractable.
In this thesis, we describe challenges encountered by both word-sequence-based
and syntax-tree-based MT systems today, and present an
in-depth, quantitative comparison of both models. Then we describe a
novel model for statistical MT which addresses these challenges using
a synchronous tree-adjoining grammar. We introduce a method of
converting these grammars to a weakly equivalent tree transducer for
decoding. Then we present a method for learning the rules and
associated probabilities of this grammar from aligned tree/string
training data, and empirically analyze important characteristics of
the resulting model, considering and evaluating many variations.
Finally, our results show that adjoining delivers a consistent
improvement over a baseline statistical syntax-based MT model on both
medium and large-scale MT tasks using several language pairs.
|
| 30 Sep 2011 | Dirk Hovy |
Aligning Events and Time Stamps
Time: 4:00 pm - 5:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: Machine Reading relies to a large extent on information about entities and events. While the definition of events is controversial, most people agree that they have certain properties like a time and a place. We exploit this by trying to establish relations between events (such as ``bombing'' or ``election'') and temporal expressions that can be resolved to a timestamp, i.e., an expression like ``last Tuesday'' to an absolute value like 20110802. This enables a number of interesting applications, such as generation of absolute timelines, cross-document event coreference, and resolution of logical discrepancies.
We define a baseline approach and improve upon it by identifying important subproblems (within-sentence vs. across-sentence), casting them as a relation extraction problem and showing that classification with kernel methods works well in capturing the information. Our results are competitive with previous approaches and reach a F-score of 76.6.
We also show that resolution across sentences is a lot harder and cannot be approached with the same techniques used for the within-sentence. We outline some promising findings and suggest further research.
|
| 16 Sep 2011 | Cerstin Mahlow (University of Zurich) |
Linguistically supported editing and revising: concept and prototypical implementation based on interactive NLP resources
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: Composing, revising, and editing are highly demanding tasks. Even in polished and published texts from professional writers we can observe errors and mistakes. For many errors, we can infer how they came to be: Word processors offer character-based functions only. These functions do not take into account elements and structures of the language the author is using. Authors are thus forced to translate their high-level goals into long and complex sequences of low-level character-based functions. Both the translation process and the execution of such sequences of functions are error-prone. However, in text editors for programmers ww find so-called language-aware editing functions. These functions operate on the elements and structures of a programming or mark-up language and help to avoid errors, as language-aware functions make revising and editing less tedious and error-prone. We argue that the concept of language awareness can be transferred to writing natural language texts using word processors. We propose functions that take the structures of natural languages into consideration. We distinguish information functions, movement functions, and operations to support revising and editing. The design is based on current findings from writing research. Language-aware editing functions rely on the recognition and categorization of relevant elements and structures with respect to a certain language. We use methods and resources from computational linguistics for morphological analysis and generation, and for part-of-speech tagging. When evaluating respective resources we face a rather disappointing situation: NLP resources for German are less suitable than assumed and less applicable for real-world applications than usually claimed in the literature.
Our prototypical implementation of language-aware functions for revising and
editing of German texts serves as a proof of concept. The implementation
illustrates opportunities and limits of current NLP resources for German.
|
| 09 Sep 2011 | Richard Socher (Stanford University) |
Recursive Deep Learning in Natural Language Processing and Computer Vision
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: Hierarchical and recursive structure is commonly found in different modalities, including natural language sentences and scene images. I will present some of our recent work on three recursive neural network architectures that learn meaning representations for such hierarchical structure. These models obtain state-of-the-art performance on several language and vision tasks. The meaning of phrases and sentences is determined by the meanings of its words and the rules of compositionality. We introduce a recursive neural network (RNN) for syntactic parsing which can learn vector representations that capture both syntactic and semantic information of phrases and sentences. For instance, the phrases "declined to comment" and "would not disclose" have similar representations. Since our RNN does not depend on specific assumptions for language, it can also be used to find hierarchical structure in complex scene images. This algorithm obtains state-of-the-art performance for semantic scene segmentation on the Stanford Background and the MSRC datasets and outperforms Gist descriptors for scene classification by 4%. The ability to identify sentiments about personal experiences, products, movies etc. is crucial to understand user generated content in social networks, blogs or product reviews. The second architecture I will talk about is based on semi-supervised recursive autoencoders (RAE). RAEs learn vector representations for phrases sufficiently well as to outperform other traditional supervised sentiment classification methods on several standard datasets. Lastly, I describe an alternative unsupervised RAE model that can learn features which outperform previous approaches for paraphrase detection on the Microsoft Research Paraphrase corpus. This talk presents joint work with Andrew Ng and Chris Manning.
Bio: Richard Socher is a Computer Science PhD student at Stanford, co-advised by Chris Manning and Andrew Ng. Most recently, he won the Yahoo! Key Scientific Challenges Program Award and the Distinguished Application Paper Award at ICML, 2011 for his work on recursive deep learning.
|
| 24 Aug 2011 | Sravana Reddy |
Cracking Running-Key Ciphers and Deciphering Speech (Interns Final Talk)
Time: 2:30 pm - 3:00 pm Location: 4th Floor Large Conference Room [460] Abstract: In the first part of this talk, I will discuss our work on deciphering running-key ciphers, which are produced by encrypting the plaintext with a natural language string of the same length as the plaintext (the 'running key'). These ciphers are harder to crack than simple substitution ciphers, and no previous work has succeeded in decoding them.
The second part of the talk will address the problem of speech recognition without access to word pronunciations or annotated training data. The problem's motivations arise from languages and domains where pronunciation lexicons and transcribed speech are not available. Given a representation of the speech as a sequence of phonemes, and a language model from non-parallel text, we present methods to find the sequence of words correspoding to the speech input.
|
| 24 Aug 2011 | Xuchen Yao |
Introducing context-dependent features into machine translation (Interns Final Talk)
Time: 2:00 pm - 2:30 pm Location: 4th Floor Large Conference Room [460] Abstract: One fundamental assumption in machine translation is that sentences are translated independently of each other. We attack this assumption by trying to achieve lexical translation consistence among sentences within the same document. An additional lexicon reuse feature is introduced to help the decoder select a more consistent translation. In this talk we will discuss the design of the reuse feature and show experimental results. |
| 19 Aug 2011 | Stephen Tratz (PhD defense practice talk) |
Semantically-Enriched Parsing for Natural Language Understanding
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: This thesis details three contributions to the advancement of semantic-enriched parsing for English sentences: inventories of semantic relations covering three semantically ambiguous linguistic phenomena, large datasets annotated according to the inventories, and, finally, a suite of tools for semantically-enriched parsing built using the data. For the purposes of this thesis, semantically-enriched parsing is defined as the reconstruction of the underlying grammatical structure of text along with shallow semantic annotation of semantically-ambiguous structures. Ultimately, semantically-enriched parsing is one of the most critical steps in natural language understanding---the initial step in which the text is read by the machine into a knowledge representation for further processing and reasoning. The first contribution of this thesis is to advance the theoretical foundations for the interpretation of three ambiguous linguistic phenomena in English that have significant overlap in terms of the relations expressed: noun compounds, possessive constructions, and prepositions. For these, I define inventories of relations based upon extensive annotation by myself, previous work by others, and inter-annotator agreement studies. In the case of prepositions, the relations are created by refining an existing resource whereas the other two are created from scratch. In addition to mappings to prior work, mappings are provided across the different inventories in order to create a unified set of relations. Second, I produce large datasets annotated according to the aforementioned sense inventories. Such data is vital for training most automatic tools and also provides exemplars for the theory embodied in the inventories. Some of these datasets are created from scratch, including a collection of over 17,500 noun compounds and a collection of over 21,900 possessive construction examples. In the case of prepositions, an existing resource including over 24,000 annotated examples is refined. The final contribution is a suite of tools that can construct semantically-enriched parse trees. The suite is designed to work in a sequential, pipeline-like fashion and can be thought of as consisting of two subsections. The first part reconstructs the grammatical structure of the text using a dependency parser that extends the non-directional easy-first algorithm developed by Goldberg and Elhadad (2010) in order to support non-projective trees and is trained using my improved dependency tree conversion of the Penn Treebank. Second is a semantic annotation module that adds shallow semantic annotation for noun compounds, preposition senses, and possessives. Combined, these tools produce semantically-enriched parse trees that include both grammatical structure and shallow semantics. The core parser itself achieves state-of-the-art accuracy and can process over 75 sentences per second, which is substantially faster than most of the accurate parsers available today. In conclusion, this thesis work provides significant contributions to computational linguistics, both in terms of theory and resources. It advances our understanding of the relations expressed by three semantically-ambiguous linguistic phenomena, creates large annotated datasets useful for machine learning, and produces a fast, accurate, and informative system for semantically-enriched parsing.
|
| 17 Aug 2011 | Licheng Fang |
Structured Language Modelling for Machine Translation
Time: 2:00 pm - 2:30 pm Location: 4th Floor Large Conference Room [460] Abstract: Machine translation can potentially benefit from the guidance of a language model that evaluates translation candidates based on syntactic structures. In this talk we are going to describe the summer project to build such an incremental structured language model that can be used in machine translation systems that generate the target language in a left-to-right manner. We will describe in detail our work in modelling, search, and parameter smoothing.
|
| 05 Aug 2011 | Dave Uthus |
Overcoming Information Overload in Navy Chat
Time: 3:00 pm - 4:00 pm Location: 4th Floor Large Conference Room [460] Abstract: In this talk, I will describe the research we are undertaking at the Naval Research Laboratory which revolves around chat (such as Internet Relay Chat) and the problems it causes in the military domain. Chat has become a primary means for command and control communications in the US Navy. Unfortunately, its popularity has contributed to the classic problem of information overload. For example, Navy watchstanders monitor multiple chat rooms while simultaneously performing their other monitoring duties (e.g., tactical situation screens and radio communications). Some researchers have proposed how automated techniques can help to alleviate these problems, but very little research has addressed this problem. I will give an overview of the three primary tasks that are the current focus of our research. The first is urgency detection, which involves detecting important chat messages within a dynamic chat stream. The second is summarization, which involves summarizing chat conversations and temporally summarizing sets of chat messages. The third is human-subject studies, which involves simulating a watchstander environment and testing whether our urgency detection and summarization ideas, along with 3D-audio cueing, can aid a watchstander in conducting their duties.
Short Bio: David Uthus is a National Research Council Postdoctoral Fellow hosted at the Naval Research Laboratory, where he is currently undertaking research focusing on analyzing multiparticipant chat. He received his PhD (2010) and MSc (2006) from the University of Auckland in New Zealand and his BSc (2004) from the University of California, Davis. His research interests include microtext analysis, machine learning, metaheuristics, heuristic search, and sport scheduling.
|
| 15 Jul 2011 | Markus Dreyer (SDL Language Weaver) |
Discovering Morphological Paradigms from Plain Text Using a Dirichlet Process Mixture Model (EMNLP 2011 practice talk)
Time: 3:30 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: We present an inference algorithm that organizes observed words (tokens) into structured inflectional paradigms (types). It also naturally predicts the spelling of unobserved forms that are missing from these paradigms, and discovers inflectional principles (grammar) that generalize to wholly unobserved words. Our Bayesian generative model of the data explicitly represents tokens, types, inflections, paradigms, and locally conditioned string edits. It assumes that inflected word tokens are generated from an infinite mixture of inflectional paradigms (string tuples). Each paradigm is sampled all at once from a graphical model, whose potential functions are weighted infinite-state transducers with language-specific parameters to be learned. These assumptions naturally lead to an elegant empirical Bayes inference procedure that exploits Monte Carlo EM, belief propagation, and dynamic programming. Given 50-100 seed paradigms, adding a 10-million-word corpus reduces prediction error for morphological inflections by up to 10%.
This is joint work with Jason Eisner, JHU.
|
| 15 Jul 2011 | Jonathan May (SDL Language Weaver) |
Tuning as Ranking (EMNLP 2011 practice talk)
Time: 3:00 pm - 3:30 pm Location: 11th Floor Large Conference Room [1135] Abstract: We offer a simple, effective, and scalable method for statistical machine translation parameter tuning based on the pairwise approach to ranking. Unlike the popular MERT algorithm, our pairwise ranking optimization (PRO) method is not limited to a handful of parameters and can easily handle systems with thousands of features. Moreover, unlike recent approaches built upon the MIRA algorithm of Crammer and Singer, PRO is easy to implement. It uses off-the-shelf linear binary classifier software and can be built on top of an existing MERT framework in a matter of hours. We establish PRO's scalability and effectiveness by comparing it to MERT and MIRA and demonstrate parity on both phrase-based and syntax-based systems in a variety of language pairs, using large scale data scenarios. |
| 07 Jul 2011 | Deniz Yuret (Koc University) |
The Noisy Channel Model for Unsupervised Word Sense Disambiguation
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: We introduce a generative probabilistic model, the noisy channel model, for unsupervised word sense disambiguation. In our model, each context C is modeled as a distinct channel through which the speaker intends to transmit a particular meaning S using a possibly ambiguous word W. To reconstruct the intended meaning the hearer uses the distribution of possible meanings in the given context P(S|C) and possible words that can express each meaning P(W|S). We assume P(W|S) is independent of the context and estimate it using WordNet sense frequencies. The main problem of unsupervised WSD is estimating context dependent P(S|C) without access to any sense tagged text. We show one way to solve this problem using a statistical language model based on large amounts of untagged text. Our model uses coarse-grained semantic classes for S internally and we explore the effect of using different levels of granularity on WSD performance. The system outputs fine grained senses for evaluation and its performance on noun disambiguation is better than most previously reported unsupervised systems and close to the best supervised systems. Short Bio: Deniz Yuret is an assistant professor in Computer Engineering at Koc University in Istanbul. Previously he was at the MIT AI Lab and later co-founded Inquira, Inc. His research is on lexical semantics and unsupervised approaches to parsing and disambiguation. Currently he is one of the organizers of the SemEval3 semantic evaluation exercise, co-chair for the ACL 2011 semantics area, and an editor for the Computational Linguistics Journal.
|
| 28 Jun 2011 | Suzy Howlett (Macquarie University) |
Confidence in Syntax for Statistical Machine Translation
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: Phrase-based statistical machine translation typically uses no syntactic information during translation, but while this information intuitively seems useful, including it has not necessarily helped translation performance. My PhD project is looking at this problem in the context of a syntactically-informed reordering preprocessing step prior to phrase-based translation. My work so far has shown that this preprocessing step does not necessarily improve performance when applied to every sentence; in my project I aim to develop a lattice-based system, armed with a number of syntax-based confidence features, that can choose on a sentence-by-sentence basis whether to use the reordering. In this presentation I will outline my progress so far, and welcome feedback and suggestions, particularly with respect to features to consider.
Short Bio:
Suzy Howlett is a PhD student at the Centre for Language Technology at Macquarie University, Australia, under the supervision of Mark Dras. She studied computer science and linguistics as an undergraduate at the University of Sydney, finishing in 2008 with an Honours year with James Curran, looking at automatically annotating additional training data for the C&C statistical CCG parser.
|
| 17 Jun 2011 | Sravana Reddy |
Unsupervised Discovery of Rhyme Schemes (ACL practice talk)
Time: 3:40 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: We describe an unsupervised, language-independent model for finding rhyme schemes in poetry, using no prior knowledge about rhyme or pronunciation.
|
| 17 Jun 2011 | Xuchen Yao |
Nonparametric Bayesian Word Sense Induction (ACL practice talk)
Time: 3:00 pm - 3:40 pm Location: 11th Floor Large Conference Room [1135] Abstract: We propose the use of a nonparametric Bayesian model, the Hierarchical Dirichlet Process (HDP), for the task of Word Sense Induction. Results are shown through comparison against Latent Dirichlet Allocation (LDA), a parametric Bayesian model employed by Brody and Lapata (2009) for this task. We find that the two models achieve similar levels of induction quality, while the HDP confers the advantage of automatically inducing a variable number of senses per word, as compared to manually fixing the number of senses a priori, as in LDA. This flexibility allows for the model to adapt to terms with greater or lesser polysemy, when evidenced by corpus distributional statistics.
|
| 10 Jun 2011 | Cartic Ramakrishnan |
The Role of Information Extraction in the Design of a Document Triage Application for Biocuration
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: Traditionally, automated triage of papers is performed using lexical (unigram, bigram, and sometimes trigram) features. This talk explores the use of information extraction (IE) techniques to create richer linguistic features than traditional bag-of-words models. Our classifier includes lexico-syntactic patterns and more-complex features that represent a pattern coupled with its extracted noun, represented both as a lexical term and as a semantic category. Our experimental results show that the IE-based features can improve performance over unigram and bigram features alone. We present intrinsic evaluation results of full-text document classification experiments to determine automatically whether a paper should be considered of interest to biologists at the Mouse Genome Informatics (MGI) system at the Jackson Laboratories. We also further discuss issues relating to design and deployment of our classifiers as an application to support scientific knowledge curation at MGI. |
| 27 May 2011 | Shu Cai |
Language-Independent Parsing with Empty Elements
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: We present a simple, language-independent method for integrating recovery of empty elements into syntactic parsing. This method outperforms the best published method we are aware of on English and a recently published method on Chinese.
This is a joint work with David Chiang and Yoav Goldberg
|
| 06 May 2011 | Abe Kazemzadeh (USC) |
Natural Language Descriptions of Emotions
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: This proposal seeks to explain how humans describe emotions using natural language. The focus of the proposal is on words and phrases that refer to emotions, rather than the more general phenomena of emotional language. The main problem I address is that if natural language descriptions of emotions refer to abstract concepts that are local to a particular human (or agent), then how do these concepts vary from person to person and how can shared meaning be established between people. The thesis of the proposal is that natural language emotion descriptions are definite descriptions that refer to theoretical objects, which provide a logical framework for dealing with this phenomenon in scientific experiments and engineering solutions. An experiment, Emotion Twenty Questions (EMO20Q), was devised to study the social natural language behavior of humans, who must use descriptions of emotions to play the familiar game of twenty questions when the unknown word is an emotion. The idea of a theory based on natural language propositions is developed and used to formalize the knowledge of a sign-using agent. Based on this pilot data, it was seen that approximately 25% of the emotion descriptions referred to emotions as objects with dimensional attributes, similarity, or subsethood. This motivated the author to use interval type-2 fuzzy sets as a computational model for the conceptual meaning of emotion descriptions. This model introduces a definition of a variable that ranges over emotions and allows for both inter- and intra-subject variability. A second experiment used interval surveys and translation tasks to assess this model. Finally, the author proposes the use of spectral graph theory to represent emotional knowledge as a network of proposition nodes that are connected to emotion nodes based on data from EMO20Q.
Short Bio: Abe Kazemzadeh is a PhD student at the USC Computer Science Dept and a
research assistant at the the Signal Analysis and Interpretation
Laboratory (SAIL). His interests include natural language, logic,
emotions, games, and algebra.
|
| 29 Apr 2011 | Marie-Catherine de Marneffe (Stanford University) |
Computational models of utterance meaning
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: Much of the meaning conveyed in language use goes beyond the literal meaning of the words. Suppose someone asks whether I want to go for lunch, and I reply: "I had a very large breakfast". The utterance does not convey only what it literally means, my interlocutor is probably going to infer that I am not hungry and do not want to go for lunch now. Computational systems today understand at most the literal meaning of human language utterances. I aim at capturing aspects of utterance meaning, the kind of information that a reader will reliably extract from an utterance within text. The first part of the talk concentrates on interpreting answers to yes/no questions which do not straightforwardly convey a 'yes' or 'no' answer. I focus on questions involving scalar modifiers (Was it acceptable? It was unprecedented.) and numerical answers (Are you kids little? I have a 10 year-old and a 7 year-old.). I exploit the availability of large amount of text to learn meanings from words and sentences in real context. I show that we can ground scalar modifier meaning based on large unstructured databases, and that such meanings can drive pragmatic inference. The second part of the talk targets veridicality -- whether a speaker intends to convey that the events described are actual, non-actual or uncertain -- which is central to language understanding, but little used in relation and event extraction systems. What do people infer from a sentence such as FBI agents alleged in court documents today that Zazi had admitted receiving weapons and explosives training from al Qaeda operatives? Did Zazi received weapons and explosives training? I show that not only lexical semantic properties but context and world knowledge shape veridicality judgments. Since such judgments are not always categorical, I suggest they should be modeled as distributions, and propose a classifier to do so. The classifier features provide a nuanced picture of the diverse factors that affect veridicality.
Short Bio:
Marie-Catherine de Marneffe is a fifth-year PhD student in Linguistics at Stanford University. Prior to her
doctoral studies, she visited the Stanford NLP research group for 2 years, working with Christopher D. Manning.
In 2000, she received her master degree in Classical Languages, and a master in Computer Science in 2002,
both from the Université catholique de Louvain (Belgium). Her work in computational semantics focuses on
on detecting entailment and contradiction in texts, grounding meaning from large unstructured databases, and assessing the information status of events from a reader's perspective. She is also interested in language acquisition, studying how
children acquire verb forms in French.
|
| 22 Apr 2011 | Dirk Hovy |
Models and Training for Unsupervised Preposition Sense Disambiguation
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: We present a preliminary study on unsupervised preposition sense disambiguation (PSD), comparing different models and training techniques (EM, MAP-EM with L0 norm, Bayesian inference using Gibbs sampling). To our knowledge, this is the first attempt at unsupervised preposition sense disambiguation. Ultimately, we want to disambiguate prepositions not by and for themselves, but in the context of sequential semantic labeling. This should also improve disambiguation of the words linked by the prepositions (here, morning, shopped, and Rome). We propose using unsupervised methods in order to leverage unlabeled data, since, to our knowledge, there are no annotated data sets. Our best accuracy for PSD reaches 56%, a significant improvement (at p < .001) of 16% over the most-frequent-sense baseline.
This is a joint work with Ashish Vaswani, Stephen Tratz, David Chiang, and Eduard Hovy
|
| 15 Apr 2011 | Thomas Schoenemann |
Computing Viterbi Alignments via Integer Linear Programming
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: This talk is about an optimization problem that was shown to be NP-hard: computing optimal alignments for the IBM-3 translation model. I will show that in practice it can be solved quite efficiently via Integer Linear Programming. In addition to using a standard solver I will also show problem-specific preprocessing techniques: by deriving upper and lower bounds, a large number of variables can be removed from the start. Short Bio: Thomas Schoenemann was born and grew up in Germany. He studied Computer Science at RWTH Aachen, Germany, where he got a diploma in 2005, having written his diploma thesis on the topic of confidence measures in machine translation in the group of Hermann Ney. Afterwards he went to the University of Bonn, Germany, to do his Ph.D. thesis in computer vision in the years 2006-2008. Up to a month ago he was a postdoc in the vision group at Lund University, Sweden, where he also resumed his work on translation. Currently he is taking a time off to explore other fields and broaden his scope.
|
| 18 Mar 2011 | Sujith Ravi (PhD defense practice talk) |
Deciphering Natural Language
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: Most state-of-the-art techniques used in natural language processing (NLP) are supervised and require labeled training data. For example, statistical language translation requires huge amounts of bilingual data for training translation systems. But such data does not exist for all language pairs and domains. Using human annotation to create new bilingual resources is not a scalable solution. This raises a key research challenge: How can we circumvent the problem of limited labeled resources for NLP applications? Interestingly, cryptanalysts and archaeologists have tackled similar challenges in solving "decipherment problems".
This thesis work aims to bring together techniques from classical cryptography, NLP and machine learning. We introduce a novel approach called "natural language decipherment" that can solve natural language problems without labeled (parallel) data. In this talk, we show how a wide variety of NLP problems can be formulated as decipherment tasks---for example, in statistical language translation one can view the foreign-language text as a cipher for English. Instead of relying on parallel training data, decipherment uses knowledge of the target language (e.g., English) and large quantities of readily available monolingual source (cipher) data to induce bilingual connections between the source and target languages. Using decipherment techniques, we make headway in attacking a hierarchy of problems ranging from letter substitution decipherment to sequence labeling problems (such as part-of-speech tagging) to language translation. Along the way, we make several key contributions---novel unsupervised algorithms that search for minimized models during decipherment and achieve state-of-the-art results on a number of important natural language tasks. Unlike conventional approaches, these decipherment methods can be easily extended to multiple domains and languages (especially resource-poor languages), thereby helping to spread the impact and benefits of NLP research.
|
| 11 Mar 2011 | Cosmin Adrian Bejan (ICT) |
Nonparametric Bayesian Models for Clustering Feature-Rich Linguistic Objects
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: In this talk, I will present how a new class of unsupervised, nonparametric Bayesian models can be effectively applied to solve real data applications that involve clustering feature-rich linguistic objects. First, I will describe a generalization of the hierarchical Dirichlet process model to account for additional properties associated with observable objects. In addition, to overcome some of the limitations of this new model, I will then describe a new hybrid model which combines an infinite latent class model with a discrete time series model. The main advantages of this hybrid model are the abilities to represent a potentially infinite number of features associated with observable objects and to perform an automatic selection of the most salient features. Furthermore, all the models described in this talk are designed to account for a potential number of categorical outcomes. The evaluation performed for solving both within- and cross document event coreference shows significant improvements of the models when compared against three baselines for this task. Short Bio: Cosmin Adrian Bejan is a postdoctoral researcher at the USC Institute for Creative Technologies, where he is currently working on applications that involve extraction and analysis of commonsense knowledge from large collections of text documents. His research interests include event semantics, semantic parsing, commonsense causal reasoning, unsupervised learning, and nonparametric Bayesian methods.
|
| 04 Mar 2011 | Steve DeNeefe (practice job talk) |
Tree Adjoining Machine Translation
Time: 4:30 pm - 5:30 pm Location: 11th Floor Large Conference Room [1135] Abstract: Tree adjoining grammars (TAGs) have greater linguistic expressiveness than the tree substitution grammars used in many natural language tasks, but are typically considered too complex or computationally expensive for practical systems. Many current statistical machine translation (MT) models use tree substitution to memorize sequences of words or constituents, specifying exactly what phrases to use or exactly what trees are grammatical. Adding the operation of tree adjoining provides the freedom to splice additional information into an existing grammatical tree. An adjoining translation model allows general, linguistically-motivated translation patterns to be learned without the clutter of endless variations of optional material. The appropriate modifiers, such as adjectives, adverbs, and prepositional phrases, can later be grafted in as needed to translate details. We show that the increased generalization power provided by adjoining, when used carefully, improves MT quality without becoming computationally intractable. In this talk, we describe challenges encountered by phrase-based and syntax-based MT systems today, and present an in-depth, quantitative comparison of both models. Then, we describe a novel model for statistical MT which addresses these challenges using a synchronous tree adjoining grammar. We introduce a method of converting these grammars to a weakly equivalent tree transducer for decoding. Then we present a method for learning the rules and associated probabilities of this grammar from aligned tree/string training data. Finally, our results show that adjoining delivers a consistent improvement over a baseline statistical syntax-based MT model on both medium and large-scale MT tasks using several language pairs.
|
| 03 Mar 2011 | Christopher Thomas (Wright State University) |
What Goes Around Comes Around -- Improving the State of Knowledge on the Web through On-Demand Model Creation
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: Information extraction is concerned with the retrieval of structured information from unstructured sources. Knowledge extraction/acquisition will need to go a step further by testing whether the extracted information is actually true. Since none of the extraction systems in current use can guarantee a perfect precision, it is necessary to incorporate manual verification steps into the information extraction pipeline in order to use extracted facts in further reasoning. My talk will present a framework that adopts a cyclic approach to advancing the state of factual knowledge within a system, taking advantage of available formal/structured knowledge sources, information extraction and human/social computing to verify the extracted information. For the fact extraction part, the system uses LoD as training data, a domain hierarchy extractor to delineate domain boundaries and non-NLP surface-pattern-based open IE techniques to connect concepts within the hierarchy. To combat the low recall that most IE approaches face, the system deploys generalization techniques and pertinence computation to increase the number of patterns. Verification is done by means of information use under the assumption that correct information will be utilized more often than incorrect one.
Bio:
Christopher Thomas is a PhD candidate in the Kno.e.sis Center at Wright State University. His research is in epistemological aspects of Computer Science and Artificial intelligence, namely knowledge extraction, representation, verification and dissemination. To build a coherent framework for this kind of systems epistemology, his publications span technical work on ontology design, ontology learning, information quality and information extraction as well as conceptual work on knowledge representation and social computing methods for knowledge verification.
|
| 17 Feb 2011 | Alan Ritter (University of Washington) |
Status Messages: A Unique Textual Source of Realtime and Social Information
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: Recently there has been an explosion in the number of users posting short status messages on Social Media websites such as Facebook and Twitter. Although noisy and informal, this new style of text represents a valuable source of information not available elsewhere: it provides the most up-to-date information on current events, in addition to a massive publicly available corpus of naturally occurring human conversations. In this talk I will present ongoing work which explores both of these aspects. First, I will describe efforts towards Information Extraction from status messages. Because statuses can be posted quickly and are widely disseminated, they often provide the most up-to-date source of information on current events around the world and locally. This dynamically changing source of realtime information is already being processed using keyword extraction techniques, for example the "trends" displayed on Twitter's website provide a list of phrases which are frequent in the current stream of messages. In order to move beyond a flat list of phrases, we have been investigating the feasibility of applying Information Extraction techniques to produce more structured representations of events. A key challenge is the noisy nature of this data; unlike newswire, or biomedical text, status messages contain frequent misspellings and abbreviations, inconsistent capitalization, unique grammar, etc... To deal with these issues, we have been annotating a corpus of Twitter Posts with POS tags and Named Entities, then using these annotations to train Twitter-specific NLP tools. As a demonstration of their utility, the resulting tools are combined to produce a calendar of popular events occurring in the future. In addition, I will discuss work which exploits a corpus of roughly 1.3 million naturally occurring conversations collected from Twitter for building models of human conversation. Three data-driven approaches to generating responses to Twitter status posts are considered, based on either information retrieval or phrase-based statistical machine translation. Although there are many challenges to overcome in adapting phrase-based SMT to dialogue, we show that it is a promising approach to this problem. We compare these approaches in a human evaluation, using annotators from Amazon's Mechanical Turk service. Furthermore, we measure agreement between human evaluators and the BLEU automatic MT evaluation metric. As far as we are aware, this is the first work to investigate the application of phrase-based SMT to dialogue generation.
Short Bio: Alan Ritter is a graduate student at the University of Washington advised by Oren Etzioni. His research interests are in Information Extraction, Computational Lexical Semantics, and Language Processing in Social Media.
|
| 14 Feb 2011 | Hagen Fuerstenau (University of Saarland) |
Learning Structured Semantics under Weak Supervision
Time: 11:00 am - 12:00 pm Location: 4th Floor Large Conference Room [460] Abstract: In this talk I will present recent work on two topics: syntactically structured representations of word meaning in context and semi-supervised semantic role labeling. These will be presented as two instances of a general theme: acquiring structured meaning representations with little or no manual annotation. Vector space models have become a standard way of representing word meaning that can be learned in an unsupervised way. The problem of polysemy, however, has only recently been addressed within this framework. Several approaches to derive vector representations of words in specific sentential contexts have been proposed. I will present recent work on extending such contextualization operations to vector models incorporating rich syntactic structure, achieving significant improvements in context-dependent lexical substitution tasks. Going beyond the meaning of single words, I will then turn to work on semantic role labeling. Here, a key obstacle is the annotation effort required for the training of high quality role labeling systems. I will present a semi-supervised approach to semantic role labeling, based on generalizing semantic annotations from manually labeled seed sentences to unlabeled sentences via structural alignments, yielding significant improvements in role labeling performance. I will conclude my talk with an outlook onto how the search for adequate models of semantics may profit from formulation in task-specific ways. In particular, I will sketch some ideas on structured semantic models for statistical machine translation.
Bio: Hagen Fürstenau is a researcher at Saarland University, Germany. He
received an M.Sc. in Mathematics from Bonn University and is about to
finish his Ph.D. in Computational Linguistics. His research interests
include data-driven methods in computational semantics and weakly
supervised machine learning.
|
| 11 Feb 2011 | Hui Zhang |
Joint Word Alignment and Synchronous Grammar Induction
Time: 3:00 pm - 4:00 pm Location: 4th Floor Large Conference Room [460] Abstract: Synchronous grammars have been shown to be effective as models of translation, and the performance of such systems depends heavily on the quality of the grammar induced from the training data. The standard method for induction of synchronous grammars uses automatic word alignments to constrain possible derivations, which makes them prey to alignment errors. In this work, we propose a method for joint word alignment and grammar induction. Our experiments show that our method significantly outperforms the standard method, while reducing the size of the grammar by more than half. |
| 04 Feb 2011 | Stephan Gouws (Stellenbosch University) |
Measuring Conceptual Similarity by Spreading Activation over Wikipedia's Hyperlink Graph
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: The World Wide Web brought with it an unprecedented level of information overload. Computers are very effective at processing and clustering numerical and binary data, however, the automated conceptual clustering of natural-language data is considerably harder to automate. Many techniques rely on relatively simple keyword-matching techniques or probabilistic methods to measure semantic relatedness between words and documents. However, these approaches do not always accurately capture conceptual relatedness as measured by humans. In this talk I'll briefly discuss a novel use of spreading activation (SA) techniques (primarily from cognitive science) for computing semantic relatedness between words and/or documents. This is done by modelling the article hyperlink structure of Wikipedia as an associative network structure for knowledge representation. The SA technique is adapted and several problems are addressed for it to function over the derived Wikipedia hyperlink graph. We evaluate these approaches over standard document similarity datasets and by user evaluation experiments, and achieve results which compare favourably with state of the art methods.
By making use of the collaboratively-created resource Wikipedia, we
hereby also overcome a significant problem in making use of spreading
activation based techniques for information retrieval up to now, as
noted by Crestani (1997): "The problem of building a network which
effectively represents the useful relations [between concepts] has
always been the critical point of many of the attempts to use SA in IR.
These networks are very difficult to build, to maintain and keep up to
date.
|
| 28 Jan 2011 | Markus Dreyer (SDL Language Weaver, formerly @ Johns Hopkins) |
A Non-Parametric Model for the Discovery of Inflectional Paradigms from Plain Text using Graphical Models over Strings
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: Statistical natural language processing can be difficult for morphologically rich languages. The observed vocabularies of such languages are very large, since each word may have been inflected for morphological properties like person, number, gender, tense, or others. This unfortunately masks important generalizations, leads to problems with data sparseness and makes it hard to generate correctly inflected text. The presented dissertation work tackles the problem of inflectional morphology with a novel, unified statistical approach. We present a generative probability model that can be used to learn from plain text how the words of a language are inflected, given some minimal supervision. In other words, we discover the inflectional paradigms that are implicit, or hidden, in a large unannotated text corpus. This model consists of several components: a hierarchical Dirichlet process clusters word tokens of the corpus into lexemes and their inflections, and graphical models over strings -- a novel graphical-model variant -- model the interactions of multiple morphologically related type spellings, using weighted finite-state transducers as potential functions. We present the components of this model, from (1) weighted finite-state transducers parameterized as log-linear models, to (2) graphical models over multiple strings, to (3) the final Bayesian non-parametric model over a corpus, its lexemes, inflections, and paradigms. These three components of the model correspond to the combined use of (1) dynamic programming, (2) belief propagation and (3) MCMC for inference. We show experimental results for several tasks along the way, including a lemmatization task in multiple languages and, to demonstrate that parts of our model are applicable outside of morphology as well, a transliteration task. Finally, we show that learning from large unannotated text corpora under our non-parametric model significantly improves the quality of predicted word inflections.
|
| 14 Jan 2011 | Donald Metzler |
Relevance and Ranking in Online Dating Systems
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: Match-making systems refer to systems where users want to meet other individuals to satisfy some underlying need. Examples of match-making systems include dating services, resume/job bulletin boards, community based question answering, and consumer-to-consumer marketplaces. One fundamental component of a match-making system is the retrieval and ranking of candidate matches for a given user. We present the first in-depth study of information retrieval approaches applied to match-making systems. Specifically, we focus on retrieval for a dating service. This domain offers several unique problems not found in traditional information retrieval tasks. These include two-sided relevance, very subjective relevance, extremely few relevant matches, and structured queries. We propose a machine learned ranking function that makes use of features extracted from the uniquely rich user profiles that consist of both structured and unstructured attributes. An extensive evaluation carried out using data gathered from a real online dating service shows the benefits of our proposed methodology with respect to traditional match-making baseline systems. Our analysis also provides deep insights into the aspects of match-making that are particularly important for producing highly relevant matches. |
| 15 Nov 2010 | Jason Riesa |
Structured Models for Bilingual Alignment (Ph.D. Proposal practice talk)
Time: 4:00 pm - 5:00 pm Location: 4th Floor Conference Room [460] Abstract: Bilingual alignment serves as an integral step and the foundation in the building of any state-of-the-art statistical machine translation system. It enables us to automatically learn and extract translation rules from hundreds of millions of words of bilingual text. Twenty years ago, the research area of machine translation was beginning to make use of the increasing availability and speed of computing resources demanded by the ideas of a previous generation, notably Weaver (1949). The IBM translation models -- statistical models for automatic word-to-word translation (Brown et al., 1990; Brown et al., 1993) - spurred a flurry of new statistical and empirical research in this area. They have become ubiquitous in the field and are easy to train in an unsupervised fashion; Al-Onaizan et al. (1999) and Och and Ney (2003) have given us open-source toolkits for this purpose.
However, there are many problems that still exist. The work presented
in this thesis proposal will eliminate many of the problems with
alignment systems that have persisted for two decades, significantly improving machine translation
quality and decidedly advancing the state-of-the-art. In achieving
this goal, we develop new models of bilingual alignment and efficient
search algorithms for working with such models.
|
| 12 Nov 2010 | Stephen Tratz |
Semantically-enriched Parsing for Natural Language Understanding (Ph.D. Proposal practice talk)
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: Natural language is riddled with many ambiguities, greatly complicating natural language processing tasks. Current parsers reconstruct the syntax of sentences without addressing the numerous ambiguities of language. This talk discusses a proposed solution for semantically-enriched parsing that consists of ontological resources, datasets, and tools that can be used to produce more informative parses of English sentences. The resulting parses consist not only of syntactic structure, but also semantic interpretations for noun compounds, preposition senses, and possessive constructions. |
| 07 Oct 2010 | Anselmo Penas |
Toward a Reading Machine
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: Machine Reading (MR) aims at bridging the gap between texts and a formal representation that a reasoning system can use to make inferences about the text. In the MR Program (MRP), the target ontology is given and the inferences are oriented to answer queries about a set of textual documents. Traditionally, this setting is approached by Information Extraction engines that use annotated texts to learn the mapping between the text and the entity classes and relations of the target ontology. However, in the current MRP setting, almost no annotated data is given, and the systems are expected to adapt to a new domain in a very short time. This setting introduces the need to develop new architectures able to learn from previous readings (of unannotated texts) and to leverage as much as possible the small amount of annotated data. The talk will report the current development of a system with these features. |
| 05 Oct 2010 | Eduard Hovy |
Toward a Computational Theory of Semantic Content
Time: 4:00 pm - 5:30 pm Location: 11th Floor Large Conference Room [1135] Abstract: Semantics has been the object of deep study for many years. Yet representation of content—the actual meaning of the symbols used in semantic propositions—is curiously absent from most of this work. This talk argues that this is so because the most useful way of conceptualizing content is not in the form of symbols but as statistical word(sense) distributions, suitably organized. Over the past few years, NLP research has increasingly treated topic signature word distributions (also called 'context vectors', 'topic models', 'language models', etc.) as a de facto replacement for semantics at various levels of granularity. Whether the task is wordsense disambiguation, certain forms of textual entailment, information extraction, paraphrase learning, and so on, it turns out to be very useful to consider a semantic unit as being defined by the distribution of word(senses) that regularly accompany it (in the classic words of Firth, "you shall know a word by the company it keeps"). This is true for semantic units of all sizes, from individual word(sense)s to sentences to text collections; the information learned and used by WSD engines closely resembles that learned by LDA and similar topic characterization engines. In this talk I argue for a new kind of semantics, which is combines traditional symbolic logic-based proposition-style semantics of the kind used in older NLP with (computation-based) statistical word distribution information (what is being called Distributional Semantics in modern NLP). The core resource is a single lexico-semantic 'lexicon' that can be used for a variety of tasks provided it is reformulated appropriately. I show how to define such a lexicon, how to build and format it, and how to use it for various tasks. The talk pulls together a wide range of related topics, including Pantel-style resources like DIRT, inferences / expectations such as those used in Schank-style expectation-based parsing and expectation-driven NLU, PropBank-style word valence lexical items, and the treatment of negation and modalities.
Combining the two views of semantics seems promising but opens many questions that need study, including the operation of logical operators such as negation and modalities over word(sense) distributions, the nature of ontological facets required to define concepts, and the action of compositionality over statistical concepts.
|
| 01 Oct 2010 | Liang Huang and Haitao Mi |
Efficient Incremental Decoding for Tree-to-String Translation (EMNLP 2010 Practice Talk)
Time: 3:30 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: Syntax-based translation models should in principle be efficient with polynomially-sized search space, but in practice they are often embarassingly slow, partly due to the cost of language model integration. In this paper we borrow from phrase-based decoding the idea to generate a translation incrementally left-to-right, and show that for tree-to-string models, with a clever encoding of derivation history, this method runs in average case polynomial-time in theory, and linear-time with beam search in practice (whereas phrase-based decoding is exponential-time in theory and quadratic-time in practice). Experiments show that, with comparable translation quality, our tree-to-string system (in Python) can run more than 30 times faster than the phrase-based system Moses (in C++). |
| 01 Oct 2010 | Erica Greene |
Automatic Analysis of Rhythmic Poetry with Applications to Generation and Translation (EMNLP 2010 Practice Talk)
Time: 3:00 pm - 3:30 pm Location: 11th Floor Large Conference Room [1135] Abstract: We employ statistical methods to analyze, generate, and translate rhythmic poetry. We first apply unsupervised learning to reveal word-stress patterns in a corpus of raw poetry. We then use these word-stress patterns, in addition to rhyme and discourse models, to generate English love poetry. Finally, we translate Italian poetry into English, choosing target realizations that conform to desired rhythmic patterns. |
| 27 Aug 2010 | Yoav Goldberg |
Intern Final Talk: Small is beautiful. Is it any good?
Time: 3:30 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: This talk summarizes our experience with searching for small models for syntax-based machine translation. I will first present cases suggesting that smaller models are desirable, and present some evidence that minimizing model size is a reasonable objective function. I will then show cases where this objective may be too aggressive. |
| 27 Aug 2010 | Sasha Rush |
Intern Final Talk: Large-scale, High-dimensional, Discriminative Machine Translation
Time: 3:00 pm - 3:30 pm Location: 11th Floor Large Conference Room [1135] Abstract: This talk summarizes my summer work on scaling a machine translation system to train on a large data set. Similar system are tuned with MERT on 1k sentences, we train a CRF on 100k sentences. I will discuss techniques for training, features, distributed scaling, regularization, and tuning, and give preliminary results. |
| 25 Aug 2010 | Sravana Reddy |
Intern Final Talk: Towards deciphering the Voynich manuscript
Time: 2:30 pm - 3:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: The Voynich manuscript is a medieval illustrated book written in an undeciphered script. I will present some questions and answers about the linguistic and statistical properties of the text. |
| 25 Aug 2010 | Anni Irvine |
Intern Final Talk: Making Discriminative Alignment Smarter
Time: 2:00 pm - 2:30 pm Location: 11th Floor Large Conference Room [1135] Abstract: Error analysis on grammars extracted for Machine Translation shows that bad and useless translation rules are usually caused by bad alignments. In this work, we improve previous work on hierarchical discriminative alignment by incorporating knowledge of foreign side parse trees, output from other aligners, and a look-ahead to grammar extraction. We give examples and results on Chinese to English translation. |
| 06 Aug 2010 | Sasha Rush (MIT) |
Dual Decomposition for Natural Language Inference
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: This talk presents dual decomposition as a general technique for NLP. The first part introduces dual decomposition as a framework for deriving inference algorithms for NLP problems. The approach relies on standard dynamic-programming algorithms as oracle solvers for sub-problems, together with a simple method for forcing agreement between the different oracles. The approach provably solves a linear programming (LP) relaxation of the global inference problem. It leads to algorithms that are simple, in that they use existing decoding algorithms; efficient, in that they avoid exact algorithms for the full model; and often exact, in that empirically they often recover the correct solution in spite of using an LP relaxation. The second part presents an application of dual decomposition to non-projective parsing . We focus on parsing algorithms for non-projective head automata, a generalization of head-automata models to non-projective structures. The dual decomposition algorithms are simple and efficient, relying on standard dynamic programming and minimum spanning tree algorithms. They provably solve an LP relaxation of the non-projective parsing problem. Empirically the LP relaxation is very often tight: for many languages, exact solutions are achieved on over 98% of test sentences.The accuracy of our models is higher than previous work on a broad range of datasets.
|
| 30 Jul 2010 | William Yang Wang (Columbia) |
Automatic Vandalism Detection in Wikipedia (COLING 2010 Practice Talk)
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: Discriminating vandalism edits from non-vandalism edits in Wikipedia is a challenging task, as ill-intentioned edits can include a variety of content and be expressed in many different forms and styles. Previous studies are limited to rule-based methods and learning based on lexical features, lacking in deep linguistic analysis. In this talk, I will discuss a novel Web-based syntactic-semantic modeling method, which utilizes Web search results as resource and trains topic-specific n-tag and syntactic n-gram language models to detect vandalism. By combining basic task-specific and lexical features, we have achieved high F-measures using logistic boosting and logistic model trees classifiers, surpassing the results reported by major Wikipedia vandalism detection systems. This is a joint work with Prof. Kathleen McKeown at Columbia University and will appear in the oral session at COLING 2010. Bio: William Yang Wang is a graduate student at Columbia University, and he is currently visiting the NL Dialog Group at USC/ICT, working on phonetically aware natural language understanding and speech synthesis. In 2008-2009, he was with the Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences.
|
| 26 Jul 2010 | Hoifung Poon (University of Washington) |
Statistical Relational Learning for Knowledge Extraction from the Web
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: Extracting knowledge from unstructured text has been a long-standing goal of NLP. The advent of the Web further increases its urgency by making available billions of online documents. To represent the acquired knowledge that is complex and heterogeneous, we need first-order logic. To handle the inherent uncertainty and ambiguity in extracting and reasoning with knowledge, we need probability. Combining the two has led to rapid progress in the emerging field of statistical relational learning. In this talk, I will show that statistical relational learning offers promising solutions for conquering the knowledge-extraction quest. I will present Markov logic, which is the leading unifying framework for representing and reasoning with complex and uncertain knowledge, and has spawned a number of successful applications for knowledge extraction from the Web. In particular, I will present OntoUSP, an end-to-end knowledge extraction system that can read text and answer questions. OntoUSP is completely unsupervised and benefits from jointly conducting ontology induction, population, and knowledge extraction. Experiments show that OntoUSP extracted five times as many correct answers compared to state-of-the-art systems, with a precision of 91%.
|
| 23 Jul 2010 | Yoav Goldberg (Ben Gurion), Sravana Reddy (Chicago), and Kevin Knight |
Three Mini-Talks on Creative Language
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: Analyzing and generating creative language (stories, poems, jokes, etc) is a growing field within computational linguistics. We will give three short talks on the topic -- Yoav on Haiku generation, Sravana on understanding eggcorns, and Kevin on poetry translation. |
| 07 Jul 2010 | Kenji Sagae |
Dynamic Programming for Linear-time Incremental Parsing (ACL 2010 Practice Talk)
Time: 3:30 pm - 4:30 pm Location: 11th Floor Large Conference Room [1135] Abstract: Incremental parsing techniques such as shift-reduce have gained popularity thanks to their efficiency, but there remains a major problem: the search is greedy and only explores a tiny fraction of the whole space (even with beam search) as opposed to dynamic programming. We show that, surprisingly, dynamic programming is in fact possible for many shift-reduce parsers, by merging "equivalent" stacks based on feature values. Empirically, our algorithm yields up to a five-fold speedup over a state-of-the-art shift-reduce dependency parser with no loss in accuracy. Better search also leads to better learning, and our final parser outperforms all previously reported dependency parsers for English and Chinese, yet is much faster. |
| 02 Jul 2010 | Zornitsa Kozareva |
Learning Arguments and Supertypes of Semantic Relations using Recursive Patterns (ACL 2010 Practice Talk)
Time: 3:30 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: A challenging problem in open information extraction and text mining is the learning of the selectional restrictions of semantic relations. We propose a minimally supervised bootstrapping algorithm that uses a single seed and a recursive lexico-syntactic pattern to learn the arguments and the supertypes of a diverse set of semantic relations from the Web. We evaluate the performance of our algorithm on multiple semantic relations expressed using "verb", "noun" and "verb prep" lexico-syntactic patterns. We embark on human based evaluation to assess the quality of the harvested information and find out that the overall accuracy of our algorithm is 90%. We also compare our results with existing knowledge base outlining the similarity and differences of the granularity and diversity of the harvested knowledge.
|
| 02 Jul 2010 | Ashish Vaswani |
An MDL-Inspired Objective Function for Unsupervised Training of Generative Models (ACL 2010 Practice Talk)
Time: 3:00 pm - 3:30 pm Location: 11th Floor Large Conference Room [1135] Abstract: The Minimum Description Length (MDL) principle is a method for model selection that trades off between the explanation of the data by the model and the complexity of the model itself. Inspired by the MDL principle, we develop an objective function for generative models that captures the description of the data by the model (log-likelihood) and the description of the model (model size). We also develop a efficient general search algorithm based on the MAP-EM framework to optimize this function. Since recent work has shown that minimizing the model size in a Hidden Markov Model for part-of-speech (POS) tagging leads to higher accuracies, we test our approach by applying it to this problem. The search algorithm involves a simple change to EM and achieves high POS tagging accuracies on both English and Italian data sets.
|
| 30 Jun 2010 | Jonathan May |
Efficient Inference Through Cascades of Weighted Tree Transducers (ACL 2010 Practice Talk)
Time: 4:00 pm - 4:30 pm Location: 11th Floor Large Conference Room [1135] Abstract: Weighted tree transducers have been proposed as useful formal models for representing syntactic natural language pro- cessing applications, but there has been little description of inference algorithms for these automata beyond formal foundations. We give a detailed description of algorithms for application of cascades of weighted tree transducers to weighted tree acceptors, connecting formal theory with actual practice. Additionally, we present novel on-the-fly variants of these algorithms, and compare their performance on a syntax machine translation cascade based on (Yamada and Knight, 2001).
|
| 30 Jun 2010 | Jason Riesa |
Hierarchical Search for Word Alignment (ACL 2010 Practice Talk)
Time: 3:30 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: We present a simple yet powerful hierarchical search algorithm for automatic word alignment. Essentially, we treat word alignment as a parsing problem, and induce a forest of alignments from which we can efficiently extract a ranked k-best list. We score a given alignment within the forest with a flexible, linear discriminative model incorporating hundreds of local and nonlocal features features, trained on a relatively small amount of annotated data. We report results on Arabic-English word alignment and translation tasks. Our model outperforms a GIZA++ Model-4 baseline by 6.3 points in F-measure, yielding a 1.1 BLEU score increase over a state-of-the-art syntax-based machine translation system. |
| 11 Jun 2010 | Yoav Goldberg (Ben Gurion University of the Negev) |
Easy First Dependency Parsing and How Different Parsers Behave Differently
Time: 3:30 pm - 4:30 pm Location: 11th Floor Large Conference Room [1135] Abstract: I will present a new kind of dependency parsing algorithm: easy first, non directional dependency parsing. This is a greedy, bottom up parser, admitting an efficient O(nlogn) implementation. Unlike shift-reduce based greedy parsers, it does not analyze the sentence in a fixed sequential order, but instead tries to make easier attachment decisions between harder ones. The parser performs well on both Hebrew and English. I also present evidence that the parsers produces qualitatively different parses than either the Malt or the MST parsers. This observation give rise to an intriguing questions: why do different parsers produce different parses? can we quantify this kind of difference? In the second part of the talk I will present my attempts to answer these kinds of questions.
|
| 10 Jun 2010 | Mark Johnson (Macquarie University) |
"Bayesian models of language acquisition" or "Where do the rules come from?" (continued from 7 Jun 2010)
Time: 4:00 pm - 5:00 pm Location: 10th Floor Conference Room Abstract: This talk will be a continuation of topics from Monday's talk. |
| 09 Jun 2010 | Steven Bird (University of Melbourne) |
The Human Language Project: Building a Universal Corpus of the World's Languages
Time: 3:30 pm - 4:30 pm Location: 10th Floor Conference Room Abstract: We present a grand challenge to build a corpus that will include all of the world's languages, in a consistent structure that permits large-scale cross-linguistic processing, enabling the study of universal linguistics. The focal data types, bilingual texts and lexicons, relate each language to one of a set of reference languages. We propose that the ability to train systems to translate into and out of a given language be the yardstick for determining when we have successfully captured a language. We call on the computational linguistics community to begin work on this Universal Corpus, pursuing the many strands of activity described here, as their contribution to the global effort to document the world's linguistic heritage before more languages fall silent.
(This talk will present joint work with Steve Abney.)
Brief Bio: Steven Bird is Associate Professor in the Department of Computer Science and Software Engineering at the University of Melbourne, and also Senior Research Associate at the Linguistic Data Consortium. In 2009 he served as president of the Association for Computational Linguistics, and he completed a textbook on Natural Language Processing, published by O'Reilly. Steven studies scalable, semi-automatic methods for analyzing spoken and written language, and for preserving endangered languages. This involves a mixture of computational modelling and linguistic fieldwork. For further details and online publications, please visit http://stevenbird.me/
|
| 08 Jun 2010 | Reut Tsarfaty (Uppsala University) |
Morphology in Parsing: A Taxonomy-Based Approach
Time: 10:30 am - 11:30 am Location: 10th Floor Conference Room [1026] Abstract: It has been a prominent empirical fact in the last decade that languages which have properties that are different from those of English, for instance, languages with free word-order and rich morphological structure, do not lend themselves naturally to the application of statistical models developed for processing English. In this talk I focus on the parsing task and based on the kind of correspondence patterns between form and function that characterize richly inflected languages, I aim to identify the properties of models that can successfully cope with parsing such structures. I start by demonstrating complex many-to-many correspondence patterns in Natural Language using data from the Semitic language Modern Hebrew. I review properties of prominent models for morphological analysis (Stump 2001), and isolate the ones that are appropriate for modeling such complex patterns. I then propose to apply the same strategy to the syntactic domain, arguing that this provides not only for a streamlined interface to morphology, but also better yields a better framework for capturing morphosyntactic interactions on the whole. I illustrate this approach via a particular instantiation, the relational-realizational model of (Tsarfaty 2010), applied to parsing Modern Hebrew. I report significant improvements on various measures over competing alternatives and previously reported results. I finally suggest that other modeling frameworks may often be enhanced to cope better with rich morphosyntactic phenomena, by similarly analyzing their underlying properties and enhancing their relational, or realizational, component, accordingly.
Speaker website:
http://stp.lingfil.uu.se/~tsarfaty/
|
| 07 Jun 2010 | Mark Johnson (Macquarie University) |
"Bayesian models of language acquisition" or "Where do the rules come from?"
Time: 2:00 pm - 3:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: Each human language contains an astronomically large (if not unbounded) number of different sentences. How can something so large and complex possibly be learnt? Over the past decade and a half we've figured out how to define probability distributions over grammars and the linguistic structures they generate, opening up the possibility of Bayesian models of language acquisition. Bayesian approaches are particularly attractive because they can exploit "prior" (e.g., innate) knowledge as well as statistical generalizations from the input. This opens the possibility of an empirical evaluation of the utility of various kinds of innate knowledge. Structured statistical learners have two major advantages over other approaches. First, because the generalizations they learn and the prior knowledge they utilize are both expressed in terms of explicit linguistic representations, it is clear what is learnt and what information is exploited during learning. Second, because of the "curse of dimensionality", learners that identify and exploit structural properties of their input seem to be the only ones that have a chance of "scaling up" to learn real languages. This talk describes Bayesian methods for learning Context-Free Grammars and a generalization of them that we call Adaptor Grammars, and applies them to problems of morphological acquisition and word segmentation. Joint work with Tom Griffiths (Berkeley) and Sharon Goldwater (Edinburgh)
Speaker Bio: Mark Johnson is a Professor of Language Science (CORE) in the Department of Computing at Macquarie University. He was awarded a BSc (Hons) in 1979 from the University of Sydney, an MA in 1984 from the University of California, San Diego and a PhD in 1987 from Stanford University. He held a postdoctoral fellowship at MIT from 1987 until 1988, and has been a visiting researcher at the University of Stuttgart, the Xerox Research Centre in Grenoble, CSAIL at MIT and the Natural Language group at Microsoft Research. He has worked on a wide range of topics in computational linguistics, but his main research area is parsing and its applications to text and speech processing. He was President of the Association for Computational Linguistics in 2003, and was a professor from 1989 until 2009 in the Departments of Cognitive and Linguistic Sciences and Computer Science at Brown University.
Professor Johnson's research area is computational linguistics, i.e., explicit computational models of language acquisition, comprehension and production. His recent work has focused on probabilistic models for syntactic parsing (identifying the way words combine to form phrases and sentences) and semantic interpretation, and on Bayesian models of the acquisition of phonology, morphology and the lexicon.
|
| 21 May 2010 | Zornitsa Kozareva |
Not All Seeds Are Equal: Measuring the Quality of Text Mining Seeds
Time: 3:00 pm - 3:30 pm Location: 11th Floor Large Conference Room [1135] Abstract: Open-class semantic lexicon induction is of great interest for the current knowledge harvesting algorithms. We propose a general framework that uses patterns in bootstrapping fashion to learn open-class semantic lexicons for different kinds of relations. These patterns require seeds. To estimate the /goodness/ (the potential yield) of new seeds, we introduce a regression model that considers the connectivity behavior of the seed during bootstrapping. The generalized regression model is evaluated on six different kinds of relations with over 10000 different seeds for English and Spanish patterns. Our approach reaches robust performance of 90% correlation coefficient with 15% error rate for any of the patterns when predicting the /goodness/ of seeds.
|
| 19 May 2010 | Jinho D. Choi (University of Colorado) |
K-best, Transition-based Dependency Parsing using Robust Risk Minimization and Automatic Feature Reduction
Time: 3:30 pm - 4:30 pm Location: 11th Floor Large Conference Room [1135] Abstract: In this paper, we introduce a way of improving the parsing accuracy of a transition-based dependency parsing model by using k-best ranking. Our approach uses a broader search space than beam search, yet keeps the parsing complexity near a quadratic average running time. In addition, we take a simple post-processing step to ensure the parsing output is a connected dependency tree. As an oracle, we use a high-performing but relatively under-explored machine learning algorithm, Robust Risk Minimization, which gives a higher parsing accuracy than the Perceptron algorithm in the experiments. We also use an automatic feature reduction technique that reduces the feature space by about 49% without compromising the parsing accuracy. We evaluate our approach on the CoNLL '09 shared task English data and improve the transition-based dependency parsing accuracy, showing a 0.64% higher accuracy than the best transition-based CoNLL '09 system. |
| 30 Apr 2010 | Walter Daelemans (University of Antwerp) |
Robust features for Computational Stylometry
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: Computational stylometry is the automatic assignment of author properties (e.g., identity, gender, personality, region, age, period, ideology, ...) to a text. Applications range from forensic use to literary scholarship. The methodology currently most successful is based on the well known approach to text categorization using training data in the form of texts with known classes. The approach works by extracting text features, selecting the best ones using statistical methods, representing the text as a vector of these features, and applying machine learning methods to the resulting vectors with associated classes. The main difference with the original text categorization approach is that the extracted text features may be complex and linguistically motivated (e.g. syntactic features). I will describe some recent applications at the University of Antwerp using this methodology: personality detection, author assignment with many authors and short texts, scribe detection in medieval texts, provenance and ideology detection in Kenyan news articles, etc. I will then focus on an empirical comparison of the robustness of different feature types in different situations. Bio: Walter Daelemans (PhD in Computational Linguistics, University of Leuven, 1987). Trained as a linguist and psycholinguist at the Universities of Antwerpen and Leuven, he specialised in computational linguistics and held research posts at the University of Nijmegen and the AI Lab of the University of Brussels before becoming a lecturer in Computational Linguistics and Artificial Intelligence at Tilburg University where he founded an early research group on machine learning of language (ILK). Since 1999 he is full-time professor at the University of Antwerp where he also heads the computational linguistics group within the CLiPS research centre. His main research interests are in machine learning of language (especially memory-based learning), text analytics, and computational psycholinguistics. He co-founded ACL's Special Interest Group on Natural Language Learning (SIGNLL) and its associated conference and shared task series (CoNLL).
|
| 16 Apr 2010 | Rutu Mulkar-Mehta |
Understanding Granularity in Natural Language Discourse (Ph.D. Proposal practice talk)
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: Granularity is the task of breaking down a complex description into simpler concepts of finer detail, such that each of the simpler concepts can be collectively describe the main description. It can be thought of as a hierarchy of varying levels of information, with fine grained and specific information i.e. information with more detail at lower levels, and coarse grained and generic information i.e. information with less detail, at higher levels. Shifting in granularity from lower to higher levels leads to information loss or abstraction of certain fine details which become irrelevant at that level. Similarly, shifting granularity from a coarse level to a fine level involves more specific details as compared to the level above this.Humans can seamlessly shift between various granularity levels when interpreting discourse. Textual descriptions are usually written such that the reader gets to know the key features of fine-grained events, and then theoverall picture from the coarse-grained description of a process. This thesis proposal is towards identification and extraction of such structures from Natural Language Discourse. |
| 14 Apr 2010 | Jonathan May |
Weighted Tree Automata and Transducers for Syntactic Natural Language Processing (Ph.D. Defense practice talk)
Time: 4:00 pm - 5:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: Weighted finite-state string transducer cascades are a powerful formalism for models of solutions to many natural language processing problems such as speech recognition, transliteration, and translation. Researchers often directly employ these formalisms to build their systems by using toolkits that provide fundamental algorithms for transducer cascade manipulation, combination, and inference. However, extant transducer toolkits are poorly suited to current research in NLP that makes use of syntax-rich models. More advanced toolkits, particularly those that allow the manipulation, combination, and inference of weighted extended top-down tree transducers, do not exist. In large part, this is because the analogous algorithms needed to perform these operations have not been defined. This thesis solves both these problems, by describing and developing algorithms, by producing an implementation of a functional weighted tree transducer toolkit that uses these algorithms, and by demonstrating the performance and utility of these algorithms in multiple empirical experiments on machine translation data. |
| 05 Apr 2010 | Satoshi Sekine (NYU) |
On-Demand Information Extraction and Knowledge Discovery
Time: 10:30 am - 11:30 am Location: 11th Floor Large Conference Room [1135] Abstract: At present, adapting an Information Extraction system to new topics is an expensive and slow process, requiring some knowledge engineering for each new topic. We propose a new paradigm of Information Extraction which operates 'on demand' in response to a user's query. On-demand Information Extraction (ODIE) aims to completely eliminate the customization effort. Given a user's query, the system will automatically create patterns to extract salient relations in the text of the topic, and build tables from the extracted information using paraphrase discovery technology. It relies on recent advances in pattern discovery, paraphrase discovery, and extended named entity tagging. I will show you a demo system, which produce a table in less than a minute for any give queries. I will also explain the need of linguistic knowledge and introduce some weakly supervised learning methods. I will show a demo of the ngram search engine, which extracts ngrams and sentences which match to a query with arbitrary wildcard. Also, I will give a brief introduction about the Web People Search. It is a task to disambiguate search results of people name and people attribute extraction task. We organized WePS1 and 2, and currently started the third evaluation, which includes 2 tasks: 1) the combined task of people disambiguation and attribute extraction and 2) organization disambiguation from twitter messages. Brief Bio: Satoshi Sekine is an Research Associate Professor at New York University. He received his MSc at UMIST, UK in 1992 and his PhD in 1998 at NYU. He has been working on various topics, including parsing, NE, Information Extraction and minimally supervised knowledge discovery. He edited a book about NE from John Benjamins, organized a JHU summer workshop 2009, WePS task, NSF symposium on Semantic Knowledge Discovery, Organization and Use in 2008, workshop on Textual Entailment and Parsing 2007 and so on.
|
| 02 Apr 2010 | Eduard Hovy |
Annotation
Time: 3:00 pm - 4:30 pm Location: 11th Floor Large Conference Room [1135] Abstract: Despite a lot of recent attention, corpus annotation remains somewhat of an art. This talk is the main part of a tutorial intended to provide the attendee with an in-depth look at the procedures, issues, and problems in corpus annotation. After describing some currently available resources, services, and frameworks (including the QDAP annotation center, Amazon's Mechanical Turk, annotation facilities in GATE, and UIMA), it addresses the open questions, pitfalls, and problems that the annotation manager should avoid, highlighting the seven major issues at the heart of annotation for which there are as yet no standard and fully satisfactory answers or methods. For each of these it provides suggestions and a possibly helpful list of references. Your participation in order to critique the tutorial is appreciated!
|
| 31 Mar 2010 | Haitao Mi (ICT China) |
Lattice and Forest for SMT
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: Statistical machine translation (SMT) has witnessed promising progress in recent years. Typically, an SMT system is characterized as a single-best pipeline, whose modules are independent to each other and only take as input single-best results from the previous module. With this assumption, each module will inevitably introduce errors in single-best outputs, which will propagate and accumulate along the pipeline, and eventually hurt the translation quality.
In order to alleviate this problem, we use compact structures such as lattices and forests instead of single-best results in each module, and then integrate both lattice and forest into a single tree-to-string system. We explore the algorithms of lattice parsing, lattice-forest-based rule extraction and decoding. Experiments show a statistically significant improvement over a start-of-the-art forest-based baseline. More interestingly, we observe a significant reduction in rule-set size when extracting with a lattice, which implies better generalizability (with a smaller model).
About the speaker: Haitao Mi is an Assistant Researcher in the Institute of Computing Technology, Chinese Academy of Sciences (CAS/ICT). He received his Ph.D. from CAS/ICT in 2009. His main research interests include syntax-based machine translation and statistical parsing. Additional information about him and his group can be found at http://nlp.ict.ac.cn/~mihaitao/
|
| 30 Mar 2010 | Victoria Fossum |
Integrating Parsing and Word Alignment in Syntax-Based Machine Translation (Ph.D. Defense practice talk)
Time: 4:00 pm - 5:00 pm Location: 11th Floor Conference Room [1135] Abstract: Training a string-to-tree syntax-based statistical machine translation system to translate from a source language (e.g. Chinese or Arabic) into a target language (e.g. English) requires the following resources: a parallel corpus (a large set of example sentences in the source language that have been translated into the target language by a human); a word alignment (a word-to-word correspondence between each source-target sentence pair); and a parse tree (a syntactic representation) of each sentence in the target language. From these training examples, the system learns to translate source-language sequences of words into target-language trees. In order to ensure broad coverage, the parallel corpus of training examples must be sufficiently large (on the order of millions of sentence pairs). Manually annotating such large corpora would be prohibitively time-consuming. Instead, these corpora must be word-aligned and parsed automatically. There are two problems with existing approaches to automatic word alignment and parsing for syntax-based machine translation. First, these processes are noisy and introduce errors which impact translation quality. Second, these processes are typically performed independently of one another. Since each process produces constraints that can be used to guide the other, by more closely integrating them, we can expect to improve the accuracy of each process. In this thesis, we address these two problems as follows: first, we improve upon the accuracy of a state-of-the-art parser; second, we use word alignments to improve parse accuracy; third, we use parses to improve word alignment accuracy; and fourth, we optimize parses and word alignments simultaneously. We examine the impact of each of these methods upon parse quality, alignment quality, and translation quality in a downstream syntax-based machine translation system. Our results demonstrate that more closely integrating word alignment and syntactic parsing can indeed improve the accuracy of each process, and in some cases leads to an improvement in translation quality relative to a state-of-the-art syntax-based statistical machine translation system.
|
| 26 Mar 2010 | Elsi Kaiser (USC) |
Discourse coherence effects in language processing: A psycholinguistic approach
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: In this talk I will discuss some recent results from my lab on the relationship between reference resolution and coherence relations. Previous work found that pronoun interpretation is guided by the coherence relations between clauses (e.g., 'as a result', 'and then', 'and similarly'), e.g. Hobbs (1979), Kehler et al. (2008). For example, consider "Phil tickled Stan, and similarly Liz poked him" (preference to interpret 'him' as Stan) and "Phil tickled Stan, and as a result Liz poked him" (more consideration of Phil as the antecedent of 'him'). However, the linguistic and cognitive properties of these coherence representations are not yet fully understood, and it is also not yet clear whether this kind of coherence sensitivity extends straightforwardly to other kinds of reduced referring expressions in addition to pronouns (e.g. anaphoric demonstratives, which can in many languages be used to refer to humans as well). I will discuss experiments -- conducted using a visual-world eye-tracking paradigm as well as other methods -- that investigate the nature and generality of these coherence representations. In addition to investigating whether coherence effects extend to other reduced referring expressions, I have also explored the domain-generality of coherence representations, for example whether non-linguistic, visuo-spatial input (video clips of moving shapes) can prime (bias) subsequent reference resolution in a seemingly unrelated task. Time permitting, I will also discuss issues related to data analysis and the annotation of data collected through psycholinguistic experiments. Brief bio:
Elsi Kaiser is an Assistant Professor of Linguistics at the University of
Southern California, with a specialization in Psycholinguistics. She
received her Ph.D. from the University of Pennsylvania in 2003, and was a
post-doc at the University of Rochester for two years before moving to
USC. Her current research focuses on the comprehension of various
referential forms (including pronouns, reflexives and demonstratives) in
different languages, which she investigates using a range of tools,
including eye-tracking.
|
| 05 Mar 2010 | Liang Huang |
Incremental Parsing
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: (a 20-minute version of this talk was given at the ISD retreat, with no technical details.) Parsing is the task of finding the most probable interpretation for a given sentence, and is a central problem in NLP because it serves as the basis of many downstream applications such as machine translation, summarization, paraphrasing, and question answering. Improving parsing efficiency and accuracy will greatly improve the applicability of those applications. However, unlike human parsing which is amazingly efficient by scanning the sentence incrementally, current state-of-the-art parsers are either extremely slow (standard algorithms like CKY scale cubically with sentence length), or purely greedy in the search algorithm that only touches a tiny fraction of the (exponentially) large search space. We instead propose a dynamic programming algorithm that does incremental parsing and ambiguity packing along the way, such that the running time is (almost) linear, and yet searches over exponentially many trees. Empirical results are very good, but further details withheld -- come to the talk!
This is a joint work with Kenji Sagae, USC/ICT.
|
| 05 Feb 2010 | David Farwell (Universitat Politecnica de Catalunya) |
Knowledge Acquisition and Textual Entailment: a proposed research program
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: The aim of this presentation is to describe a program of research in the area of automatic knowledge acquisition which has been submitted in response to the European Information and Communication Technologies FP7 Call 5, Objective 4.3: Intelligent Information Management. The objective of this research program is to develop data-driven techniques and tools for extracting common sense knowledge from unstructured text and applying it for making the approximate inferences needed in order to interpret the ambiguities of human language communication. The central activities include developing techniques and tools for: - converting texts into representations of the particular events and entities they refer to, - identifying relations between these entity and event instances such as shared participants, temporal and spatial juxtapositions, causal connections, entailments, and so on, thereby constructing representations of complex scenarios, - inducing from sets of like entity, event and scenario instances, representations of entity, event and scenario types, - using these entity, event and scenario types as background knowledge to support approximate inferencing (e.g., statistical inference rules such as poisoning probably entails death) within important interactive tasks such as Information Retrieval and web search.
The technologies developed will be validated by applying them to two broad NLU tasks: faceted search for Information Retrieval in the domain of health information and open-domain web search for web browsing and UI improvements.
|
| 22 Jan 2010 | David Chiang |
Towards Tree-to-Tree Translation
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: Statistical translation models that try to capture the recursive structure of language have been widely adopted over the last few years. These models make use of varying amounts of information from linguistic theory: some use none at all, some use information about the grammar of the target language, some use information about the grammar of the source language. But progress has been slower on tree-to-tree translation models: models that are able to learn the relationship between the grammars of both the source and target language. I will discuss the reasons why tree-to-tree translation has been a challenge, review existing attempts at tree-to-tree models, and present some of our own work-in-progress on robustly modeling source and target language syntax for significant improvements in translation quality. |
| 15 Jan 2010 | Min-Yen Kan (National University of Singapore) |
ForeCite: towards a more integrated scholarly digital library
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: Scholarly digital libraries (DLs) have managed to scale up to handle millions of documents and now feature tools to track citations and references between articles. However, users of digital libraries typically often access the DL merely to check references or to download the PDF of the document. What features will the next-generation DL need to inspire scholars to use digital library for more than accessing the document? In ForeCite, our digital library project at NUS, we believe part of the answer lies in integrating common end user's concerns: annotation, sharing, off-and-online usage and focusing on the intra-document processing. I will describe and demonstrate some of the preliminary components of the ForeCite system: including its web based front end, ParsCit (a backend open-source citation segmentation system), and ForeCiteNote (TiddyWiki based research notetaking system) and ForeCiteReader (Google Books-like interface for annotation and collaboration on notetaking, and FireCite (browser extension for recognizing citations on webpages). Speaker Bio:
Min-Yen Kan (BS;MS;PhD Columbia Univ.) is an associate professor at
the National University of Singapore. His research interests include
digital libraries and applied natural language processing. Specific
projects include work in the areas of scientific discourse analysis,
multiword expression extraction and understanding, machine translation
and applied text summarization. Currently, he is an associate editor
for "Information Retrieval" and is the Editor for the ACL Anthology,
the computational linguistics community's largest archive of published
research. More information about him and his group can be found at the
WING homepage: http://wing.comp.nus.edu.sg/
|
| 11 Dec 2009 | Anselmo Penas (UNED, Spain) |
Evaluating Question Answering Validation
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: During the last decade, Question Answering (QA) was redefined inside TREC as a kind of highly-precision-oriented Information Retrieval task where the introduction of NLP was necessary, specially for Answer Extraction purposes. The same general approach was activated at the Cross-Language Evaluation Forum (CLEF) in 2003, but for other European languages different than English, and with some different settings and subtasks. The talk will report the last 4-year cycle of the QA evaluation at CLEF, starting with the general methodology for long term QA evaluation at CLEF and the motivation for the Answer Validation task, continuing with the development of AVE in the three year campaign, and concluding with the goals, evaluation measure and results of the current QA evaluation setting after the AVE experience. |
| 09 Dec 2009 | Tomohide Shibata (Kyoto University) |
Introduction of Our Research (text analysis and IR)
Time: 3:30 pm - 4:30 pm Location: 11th Floor Large Conference Room [1135] Abstract: I am Tomohide Shibata, an assistant professor at Kyoto University, Japan. I am working with Prof. Kurohashi. I have been visiting Prof. Hovy for three weeks. In this talk, I introduce our research. Our research roughly consists of three fields: basic text analysis, information retrieval and machine translation. Among them, basic text analysis and information retrieval, which I am engaged in, are introduced. In basic text analysis, we have been developed Japanese morphological analyzer and parser, which are widely used in research community. Case frames, which describe the relation between a verb and its case components, are automatically constructed from a large Web corpus. Synonym and is-a relations are automatically extracted from a dictionary and Web corpus. In Information Retrieval, we are running a search engine infrastructure called TSUBAKI. The features of TSUBAKI are (i) the sentence structure (dependency relation) is considered in the document ranking, and (ii) the expression divergence between a query and a document is assimilated. We are also running a search result clustering system based on TSUBAKI.
|
| 04 Dec 2009 | Donald Metzler (Yahoo! Research) |
Learning Query Concept Importance Using a Weighted Dependence Model
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: Modeling query concepts through term dependencies has been shown to have a significant positive effect on retrieval performance, especially for tasks such as Web search, where relevance at high ranks is particularly critical. Most previous work, however, treats all concepts as equally important, an assumption that often does not hold, especially for longer, more complex queries. In this talk, I will describe the state-of-the-art practices for modeling query term dependencies for information retrieval using Markov random fields. Within this context I will discuss why many NLP-inspired approaches to the problem, such as query segmentation, have failed to show consistent improvements when applied to information retrieval tasks. Experimental results carried out on a number of TREC and Yahoo! Web search test collections will be presented showing the effectiveness of various types of term (in)dependence models.
Brief bio:
Donald Metzler is a Research Scientist in the Search and Computational Advertising group at Yahoo! Research. He obtained his Ph.D. from the University of Massachusetts in 2007. He is an active member of the information retrieval and web search communities, having served on the program committees of SIGIR, ECIR, HLT, EMNLP, WSDM, WWW, and ICML. He has published over 35 research papers, has 13 patents pending, and is the co-author of Search Engines: Information Retrieval in Practice. His research interests include information retrieval, web search, computational advertising, and applications of machine learning to large-scale text problems.
|
| 20 Nov 2009 | Marco Pennacchiotti (Yahoo! Research) |
Entity Extraction via Ensemble Semantics
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: In this talk I will present Ensemble Semantics (ES), a new general framework for information extraction developed at Yahoo!, that combines multiple sources of information and extractors. The ES framework is based on the hypothesis that although distributional and pattern-based extraction algorithms are complementary, they do not exhaust the semantic space; other sources of evidence can be leveraged to better combine them. In this presentation, I will focus on a specific implementation of ES for the task of entity extraction. I will report experimental results showing large gains in performance, by combining state-of-the-art distributional and pattern-based systems with a large set of features from a document webcrawl, one year of query logs, and a snapshot of Wikipedia. I will also propose an analysis of feature correlations and interactions showing the value of the different feature sets. I will conclude discussing some issues that can impact on the overall performance of entity extraction algorithms. |
| 23 Oct 2009 | Steve DeNeefe |
Tree Adjoining Machine Translation (thesis proposal practice talk)
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: Tree Adjoining Grammars have well-known advantages but are typically considered too difficult for practical systems. We propose that, when done right, adjoining improves translation quality without becoming computationally intractable. Using adjoining to model optionality allows general translation patterns to be learned without the clutter of endless variations of optional material. The appropriate modifiers can later be spliced in as needed to translate details. In this proposal, we describe challenges encountered by phrase-based and syntax-based machine translation (MT) systems today, and present an in-depth, quantitative comparison of both models. Then, we describe a novel model for statistical MT which addresses these challenges using a Synchronous Tree Adjoining Grammar. We introduce a method of converting these grammars to a weakly equivalent tree transducer for decoding. And we present a method for learning the rules and associated probabilities of this grammar from aligned tree/string training data. Finally, our initial results show that adjoining already delivers an end-to-end improvement of +0.8 BLEU over a baseline statistical syntax-based MT model on a medium-scale Arabic/English MT task. Furthermore, we demonstrate it is a competitive entry in the Urdu-English track of the 2009 NIST MT evaluation. We then propose improvements to the model, decoding, and extraction that promise to allow this new, linguistically-motivated MT model to surpass its syntax-based and phrase-based cousins in a wide range of scenarios and language pairs.
|
| 21 Oct 2009 | Douglas W. Oard (Maryland) |
Who 'Dat? Identity resolution in large email collections
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: Automated techniques that can support the human activities of search and sense-making in large email collections are of increasing importance for a broad range of uses, including historical scholarship, law enforcement and intelligence applications, and lawyers involved in "e-discovery" incident to civil litigation. In this talk, I'll briefly describe some of the work to date on searching large email collections, and then for most of the talk I will focus on the more challenging task of support for sense-making. Specifically, I'll describe joint work with Tamer Elsayed to automatically resolve the identity of people who are mentioned ambiguously (e.g., just by first name) in a collection of email from a failed corporation (Enron). Our results indicate that for people who are well represented in the collection we can use a generative model to guess the right identity about 80% of the time, and for others we are right about half the time. I'll conclude the talk with a few remarks on our next directions for techniques, evaluation, and additional types of collections to which similar ideas might be applied. About the Speaker:
Douglas Oard is an Associate Professor at the University of Maryland,
College Park, with joint appointments in the College of Information
Studies and the Institute for Advanced Computer Studies; he is on
sabbatical at Berkeley's iSchool for the Fall 2009 semester. Dr. Oard
earned his Ph.D. in Electrical Engineering from the University of
Maryland, and his research interests center around the use of emerging
technologies to support information seeking by end users. His recent work
has focused on interactive techniques for cross-language information
retrieval and techniques for search and sense-making in conversational
media. Additional information is available at
http://www.glue.umd.edu/~oard/.
|
| 09 Oct 2009 | Nandakishore Kambhatla (IBM India) |
Extracting Social Networks and Biographical Facts from Conversational Speech Transcripts
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: We present a general framework for automatically extracting social networks and biographical facts from conversational speech. Our approach relies on fusing the output produced by multiple information extraction modules, including entity recognition and detection, relation detection, and event detection modules. We describe the specific features and algorithmic refinements effective for conversational speech. These cumulatively increase the performance of social network extraction from 0.06 to 0.30 for the development set, and from 0.06 to 0.28 for the test set, as measured by f-measure on the ties within a network. The same framework can be applied to other genres of text -- we have built an automatic biography generation system for general domain text using the same approach. -- Brief Bio: Nanda Kambhatla has nearly 17 years of research experience in the areas of Natural Language Processing (NLP), text mining, information extraction, dialog systems, and machine learning. He holds 6 U.S patents and has authored over 30 publications in books, journals, and conferences in these areas. Nanda holds a B.Tech in Computer Science and Engineering from the Institute of Technology, Benaras Hindu University, India, and a Ph.D in Computer Science and Engineering from the Oregon Graduate Institute of Science & Technology, Oregon, USA. Currently, Nanda is the manager of the Data Analytics Group at IBM's India Research Lab (IRL), Bangalore. The group is focused on research on machine translation, Natural Language Processing, text analysis and machine learning techniques for developing analytics solutions to help IBM's services divisions. Most recently, Nanda was the manager of the Statistical Text Analytics Group at IBM's T.J. Watson Research Center, the Watson co-chair of the Natural Language Processing PIC, and the task PI for the Language Exploitation Environment (LEE) subtask for the DARPA GALE project. He has been leading the development of information extraction tools/products and his team has achieved top tier results in successive Automatic Content Extraction (ACE) evaluations conducted by NIST for extracting entities, events and relations from text from multiple sources, in multiple languages and genres. Earlier in his career, Nanda has worked on natural language web-based and spoken dialog systems at IBM. Before joining IBM, he has worked on information retrieval and filtering algorithms as a senior research scientist at WiseWire Corporation, Pittsburgh and on image compression algorithms while working as a postdoctoral fellow under Prof. Simon Haykin at McMaster University, Canada.
Nanda's research interests are focused on NLP and technology solutions for creating, storing, searching, and processing large volumes of unstructured data (text, audio, video, etc.) and specifically on applications of statistical learning algorithms to these tasks.
|
| 11 Sep 2009 | David Chiang |
Tutorial on HPC
Time: 3:00 pm - 4:00 pm Location: 11th Floor Large Conference Room [1135] Abstract: This tutorial will be a short introduction to using the Linux cluster at USC's High-Performance Computing (HPC) facility. Topics will include: (1) basics of starting jobs on the cluster using Torque/PBS, (2) dealing with common problems like jobs not starting or spontaneously dying, (3) maximizing the performance of your jobs (both yours and other people's), e.g., using the correct filesystem and tuning it for better speed, (4) embarrassingly parallel processing and poor-man's workflows. It will NOT cover Hadoop, MPI, real workflow management tools like Condor.
|
| 28 Aug 2009 | Adam Pauls (UC Berkeley) Michael Auli (Edinburgh) |
Intern Final Talks
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: Tree-to-String Alignment Models Machine translation systems typically rely on some form alignment as a preprocessing step. Typically, these alignments take the form of word-to-word alignments. In this talk, we will introduce several models aimed at aligning foreign words to either English words or nodes in the English parse tree. Such word-to-node alignments offer several potential advantages over traditional word-to-word alignments. Firstly, since the extraction process for some syntactic systems explicitly considers the English trees, we expect that also considering the trees at alignment time will produce alignments that will better suit the extraction process. Secondly, aligning foreign function words to English tree nodes can admits highly desirable syntactic transfer rules which cannot be directly as word-to-word alignments. Finally, word-to-node alignments can effectively model many-to-one alignments. We present four models of increasing complexity and show preliminary results for each model.
|
| 27 Aug 2009 | Erica Greene (Haverford) Paramveer Dhillon (Penn) |
Intern Final Talks
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: TALK 1: Erica Greene Title: A Statistical Foray into Poetry Abstract: Although the analysis and generation of poetry is often considered an exclusively human task, we have taken some initial steps to automate the process. We build a series of finite state transducers to analyze poetic meter and train them on a handmade corpus of poetry. We then use these trained transducers to generate poetry. Specifically, we focus on generating sonnets and limericks. ------------------------------------------ TALK 2: Paramveer Dhillon Title: Learning to simplify target language for MT + Unsupervised log-linear models for Word Alignment Abstract: We consider the Machine Translation task for the language pair (Chinese and English), where English is the target language. There are lots of redundancies in English language, e.g. It has capitalization, i.e. the first word of each sentence is capitalized, and it has different morphology i.e. it has noun and verb endings; none of which are present in Chinese. In a way, due to these redundancies, we are learning that a single Chinese word "tamen" translates to "They" and "they" and another Chinese word translates to "run", "runs" and "running". We present generative models which learn to "cluster" the target language vocabulary, by removing the above redundancies, namely (Capitalization and Different morphology). We show results on how this "clustering" affects the translation quality in end-to-end MT experiments.
In the last part of the talk, I would talk about using unsupervised
log-linear(discriminative) models for improving word alignments. There
are very few precedents of using discriminative models for word
alignment in totally unsupervised settings. (Taskar et. al. '05) and
(Lacoste-Julien et. al. '06) used maximum weight bipartite matching in
"nearly" unsupervised setting and (Blunsom et. al. '06) used CRFs for
supervised word alignment. We use log-linear models in totally
unsupervised settings to do word alignments. Speicifically we use
Contrastive Estimation (Smith et. al. '05) to shift the probability
mass to the correct set of alignments from a well-chosen
"neighborhood" of those alignments. In the end I will show some
preliminary word alignment results using our approach.
|
| 26 Aug 2009 | Sujith Ravi |
Natural Language Decipherment: Solving Problems in Natural Language Processing without Labeled Data (Thesis Proposal practice talk)
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: Natural Language Decipherment: Solving Problems in Natural Language Processing without Labeled Data (Thesis Proposal practice talk) A wide variety of problems in NLP require parallel data to train supervised models to perform different tasks. For example, in machine translation (where the task is to translate between two languages automatically) parallel data containing source/target language sentence pairs is required to train various models which can then be used to translate new sentences or documents. The dependency on parallel data for many of these NLP tasks limits their applications to specific domains, or language pairs for which a lot of training data is readily available. On the other hand, collecting parallel data for new domains, language pairs, etc. is a costly as well as time-intensive operation. For such tasks, the development of novel unsupervised approaches which require only {\em non-parallel} data for training can enable their application to new domains and potentially broaden the impact and benefits of NLP research to wider areas. A similar problem has been tackled by cryptographers and archaeologists in a different context---for "decipherment" purposes. During the 1940's and 1950's, mathematicians and scientists worked on code-breaking operations, which spurred the development of many research ideas for modern computer science. For such problems, it is highly unlikely to assume the availability of parallel data relating the ciphertext and plaintext, yet cryptographers and archaeologists have attempted to solve such tasks using various decipherment techniques along with other non-parallel sources of information.
In this thesis proposal practice talk, I will show how we combine the two ideas (decipherment and unsupervised learning for NLP problems) together and present a unified decipherment-based approach for modeling a wide range of problems in NLP. Instead of relying on parallel data, I propose to use alternate sources of linguistic knowledge and large quantities of readily available monolingual data to induce strong bilingual connections in problems such as machine transliteration and translation. The talk will describe how various NLP problems such as unsupervised part-of-speech tagging, word alignment, transliteration, and machine translation can be formulated as decipherment tasks. I will present decipherment algorithms for tackling many of these problems and show that it is possible to achieve good results for many problems of interest in NLP without using any parallel data at all.
|
| 21 Aug 2009 | Liang Huang |
Bilingually-Constrained (Monolingual) Shift-Reduce Parsing
Time: 3:00 pm - 4:15 pm Location: 4th Floor Conference Room Abstract: Jointly parsing two languages has been shown to improve accuracies on either or both sides. However, its search space is much bigger than the monolingual case, forcing existing approaches to employ complicated modeling and crude approximations. Here we propose a much simpler alternative, bilingually-constrained monolingual parsing, where a source-language parser learns to exploit reorderings as additional observation, but not bothering to build the target-side tree as well. We show specifically how to enhance a shift-reduce dependency parser to use alignment features to resolve shift-reduce conflicts. Experiments on the bilingual portion of Chinese Treebank show that, with just 3 bilingual features, we can improve parsing accuracies by 0.6% for both English and Chinese, with negligible (~6%) efficiency overhead, thus much faster than biparsing.
http://www.cis.upenn.edu/~lhuang3/biparsing.pdf
|
| 24 Jul 2009 | Adam Pauls (UC Berkeley) Ulf Hermjakob |
Practice talks for EMNLP
Time: 3:00 pm - 4:15 pm Location: 11 Large Abstract: K-Best A* Parsing (Adam Pauls) A* parsing makes 1-best search efficient by suppressing unlikely 1-best items. Existing k- best extraction methods can efficiently search for top derivations, but only after an exhaus- tive 1-best pass. We present a unified algo- rithm for k-best A* parsing which preserves the efficiency of k-best extraction while giv- ing the speed-ups of A* methods. Our algo- rithm produces optimal k-best parses under the same conditions required for optimality in a 1-best A* parser. Empirically, optimal k-best lists can be extracted significantly faster than with other approaches, over a range of gram- mar types. ------------------------------------------ Improved Word Alignment with Statistics and Linguistic Heuristics (Ulf Hermjakob) We present a method to align words in a bitext that combines elements of a traditional statistical approach with linguistic knowledge. We demonstrate this approach for Arabic-English, using an alignment lexicon produced by a statistical word aligner, as well as linguistic resources ranging from an English parser to heuristic alignment rules for function words. These linguistic heuristics have been generalized from a development corpus of 100 parallel sentences. Our aligner, UALIGN, outperforms both the commonly used GIZA++ aligner and the state-of-the-art LEAF aligner on F-measure and produces superior scores in end-to-end statistical machine translation, +1.3 BLEU points over GIZA++, and +0.7 over LEAF.
|
| 23 Jul 2009 | Mark Hopkins (Language Weaver) |
Cube Pruning as Heuristic Search (Practice talk for EMNLP)
Time: 3:00 pm - 3:45 pm Location: 11 Large Abstract: Cube pruning is a fast inexact method for generating the items of a beam decoder. Here we show that cube pruning is essentially equivalent to A* search on a specific search space with specific heuristics. We use this insight to develop faster and exact variants of cube pruning.
|
| 17 Jul 2009 | Paramveer Dhillon (Penn) |
Transfer Learning for WSD & Non-local constraints for Named Entity Recognition
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: This talk will be divided into two parts. In the first part I will talk about using Transfer Learning techniques to improve the task of Word Sense Disambiguation (WSD). Usually in supervised WSD, we suffer due to paucity of labeled data as there are some words that occur less frequently in the data and its very difficult to get enough labeled data for these words. In such cases it is very difficult to build high accuracy supervised learning models for these words. So, we propose an approach called TransFeat (based on the MDL principle) which ``transfers information", from similar words in the form of a feature relevance prior to get improved accuracies on these rare words. Besides this, our experiments show that we also get decent improvement in accuracy for words that have more amount of labeled data available. TransFeat gives accuracies that are in the worst case comparable to state-of-the-art on ONTONOTES and SENSEVAL-2 datasets.
In the second part of the talk I will talk about incorporating
non-local constraints in Named Entity Recognition (NER) systems. The
main idea is that some linguistic constraints (e.g. every occurrence
of the word ``Einstein" in the data should have the tag PER
i.e. person ) are very useful and can give improved performance but
they are non - local and hence are intractable and can not be
efficiently modeled using state-of-the-art sequence modeling methods
like CRFs. Though people have used Skip-chain CRFs (with Loopy
BP)(Sutton and McCallum '04) and Gibbs Sampling (Finkel and Manning
'05) to enforce these non-local constraints, but they turn out to be
really inefficient and custom-tailored to one particular kind of
constraints (say) consistency constraints of the type mentioned
above. We propose a constrained version of EM in which a general set
of constraints (not limited to consistency constraints!) can be
incorporated into the model. In the end I will show some results of
this approach on CoNLL 03 English and CoNLL 02 Spanish NER shared tasks.
|
| 16 Jul 2009 | Yang Liu (ICT China) |
Weighted Alignment Matrices for Statistical Machine Translation
Time: 10:30 am - 11:30 am Location: 11 Large Abstract: Current statistical machine translation systems usually extract rules from bilingual corpora annotated with 1-best alignments. They are prone to learn noisy rules due to alignment mistakes. We propose a new structure called weighted alignment matrix to encode all possible alignments for a parallel text compactly. The key idea is to assign a probability to each word pair to indicate how well they are aligned. We design new algorithms for extracting phrase pairs from weighted alignment matrices and estimating their probabilities. Our experiments on multiple language pairs show that using weighted matrices achieves consistent improvements over using n-best lists in significant less extraction time. About the speaker: Yang Liu is an Assistant Researcher at Institute of Computing Technology (ICT), Chinese Academy of Sciences. He received his PhD degree in Computer Science from ICT in 2007. His major research interests include statistical machine translation and Chinese information processing. He has been working on syntax-based modeling, word alignment, and system combination. His paper on tree-to-string translation won the Meritorious Asian NLP Paper Award of COLING/ACL 2006. He served as Reviewers for TALIP, TSLP, JNLE, ACL, EMNLP, AMTA, and SSST.
|
| 15 Jul 2009 | Yang Liu (ICT China) |
An Overview of Tree-to-String Translation Models
Time: 4:00 pm - 5:00 pm Location: 11 Large Abstract: Recent research on statistical machine translation has lead to the rapid development of syntax-based translation models, which exploit syntactic information to direct translation. In this talk, I will give an overview of tree-to-string translation models, one of the state-of-the-art syntax-based models. In a tree-to-string model, the source side is a phrase structure parse tree and the target side is a string. This talk includes the following topics: (1) tree-based tree-to-string model, (2) tree-sequence based tree-to-string model, (3) forest-based tree-to-string model, and (4) context-aware tree-to-string model. Experimental results show that the forest-based tree-to-string system outperforms Hiero significantly on Chinese-to-English translation. About the speaker: Yang Liu is an Assistant Researcher at Institute of Computing Technology (ICT), Chinese Academy of Sciences. He received his PhD degree in Computer Science from ICT in 2007. His major research interests include statistical machine translation and Chinese information processing. He has been working on syntax-based modeling, word alignment, and system combination. His paper on tree-to-string translation won the Meritorious Asian NLP Paper Award of COLING/ACL 2006. He served as Reviewers for TALIP, TSLP, JNLE, ACL, EMNLP, AMTA, and SSST.
|
| 10 Jul 2009 | Kevin Knight |
Excerpts from ACL-09 Tutorial on "Topics in Machine Translation"
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: Philipp Koehn and I will do a machine translation tutorial at ACL. Instead of an introductory tutorial, we'll do short 15-minute segments on various hot topics in MT research. For the ISI NL seminar, I'll present 3 or 4 of those topics, determined by audience vote. |
| 26 Jun 2009 | Steve DeNeefe |
Synchronous Tree Adjoining Machine Translation (Practice talk for EMNLP)
Time: 3:00 pm - 3:30 pm Location: 11 Large Abstract: Tree Adjoining Grammars have well-known advantages, but are typically considered too difficult for practical systems. We demonstrate that, when done right, adjoining improves translation quality without becoming computationally intractable. Using adjoining to model optionality allows general translation patterns to be learned without the clutter of endless variations of optional material, with extra information spliced in as needed.
In this paper, we describe a novel method for learning a type of
Synchronous Tree Adjoining Grammar and associated probabilities from
aligned tree/string training data. We introduce a method of
converting these grammars to a weakly equivalent tree transducer for
efficient decoding. Finally, we show that adjoining results in an
end-to-end improvement of +0.8 BLEU over a baseline statistical
syntax-based MT model on a large-scale Arabic/English MT task.
|
| 19 Jun 2009 | Adam Pauls (UC Berkeley) |
Hierarchical Search for Parsing (and Machine Translation)
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: Both coarse-to-fine and A* parsing use simple grammars to guide search in complex ones. We compare the two approaches in a common, agenda-based framework, demonstrating the tradeoffs and relative strengths of each method. Overall, coarse-to-fine is much faster for moderate levels of search errors, but below a certain threshold A* is superior. In addition, we present the first experiments on hierarchical A* parsing, in which computation of heuristics is itself guided by meta-heuristics. Multi-level hierarchies are helpful in both approaches, but are more effective in the coarse-to-fine case because of accumulated slack in A* heuristics. |
| 29 May 2009 | Marta Recasens Potau (Universitat de Barcelona) |
Learning-based Coreference Resolution for Spanish and Catalan
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: The task of coreference resolution identifies those expressions in a text that point to the same discourse entity. Natural language applications such as information extraction, question answering and machine translation can greatly benefit from its output (the different pieces of information in connection with the same entity are linked, pronouns are disambiguated, etc.). The task is extremely complex since a number of knowledge sources come into play, from morphology to discourse structure and world knowledge. In this talk I present the results of my PhD research up to now, including the development of two 400k-word corpora for Spanish and Catalan (AnCora) annotated at various levels (morphology, syntax, semantics, pragmatics), a 100k-word corpus for English, and a series of experiments towards building a learning-based coreference resolution system. More specifically, I'll discuss issues concerning the definition of the annotation scheme, the selection of features for machine learning, the effect of sample selection, and I'll introduce CISTELL, the new learning-approach we propose for coreference resolution. |
| 22 May 2009 | Victoria Fossum Dirk Hovy |
Practice talks for NAACL HLT
Time: 3:00 pm - 4:00 pm Location: 11th flr CR Abstract: Combining Constituent Parsers (Victoria Fossum: 3:00pm -- 3:30pm) Combining the 1-best output of multiple parsers via parse selection or parse hybridization improves f-score over the best individual parser (Henderson and Brill, 1999; Sagae and Lavie, 2006). We propose three ways to improve upon existing methods for parser combination. --------------------------------------------------------- Disambiguation of Preposition Sense Using Linguistically Motivated Features (Dirk Hovy: 3:30pm -- 4:00pm)
Classifying polysemous words into their proper sense classes is
potentially useful to any NLP application that needs to extract
information from text or build a semantic representation of the
textual information. Like instances of other word classes, many
prepositions are ambiguous, carrying different semantic meanings
(including notions of instrumental, accompaniment, location, etc.)
In this paper, we present a supervised classification approach for
disambiguation of preposition senses. We use the SemEval 2007
Preposition Sense Disambiguation datasets to evaluate our system and
compare its results to those of the systems participating in the
workshop. We derived linguistically motivated features from both sides
of the preposition. Instead of restricting these to a fixed window
size, we utilized the phrase structure. Testing with five different
classifiers, we can report an increased accuracy (76.4%) that
outperforms the best system in the SemEval task.
|
| 15 May 2009 | David Chiang |
Practice talks for NAACL HLT
Time: 3:00 pm - 4:00 pm Location: 4th flr CR Abstract: 11,001 New Features for Statistical Machine Translation (David Chiang) - Winner of Best Paper Award at NAACL/HLT 2009
We use the Margin Infused Relaxed Algorithm of Crammer et al. to add a
large number of new features to two machine translation systems: the
Hiero hierarchical phrase based translation system and our
syntax-based translation system. On a large-scale Chinese-English
translation task, we obtain statistically significant improvements of
+1.5 BLEU and +1.1 BLEU, respectively. We analyze the impact of the new features and the performance of the learning algorithm.
|
| 14 May 2009 | Sujith Ravi |
Practice talks for NAACL HLT
Time: 3:00 pm - 4:00 pm Location: 4th flr CR Abstract: Talk-1: Learning Phoneme Mappings for Transliteration without Parallel Data We present a method for performing machine transliteration without any parallel resources. We frame the transliteration task as a decipherment problem and show that it is possible to learn cross-language phoneme mapping tables using only monolingual resources. We compare various methods and evaluate their accuracies on a standard name transliteration task. This is joint work with Kevin Knight. ---------------------------------------------------- Talk-2: A New Objective Function for Word Alignment We develop a new objective function for word alignment that measures the size of the bilingual dictionary induced by an alignment. A word alignment that results in a small dictionary is preferred over one that results in a large dictionary. In order to search for the alignment that minimizes this objective, we cast the problem as one of integer linear programming. We then extend our objective function to align corpora at the sub-word level, which we demonstrate on a small Turkish-English corpus. This is joint work with Tugba Bodrumlu and Kevin Knight.
|
| 08 May 2009 | Andrew Kehler (UCSD) |
Coherence and the (Psycho-) Linguistics of Pronoun Interpretation
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: More than three decades of research has sought to uncover the principles that determine how hearers interpret pronouns in context. This work has focused predominantly on identifying so-called 'preferences' or 'heuristics' that hearers utilize based on linguistic properties of antecedent expressions. This focus is a departure from the type of approach outlined in Hobbs (1979), which argues that the mechanisms that drive pronoun interpretation are driven predominantly by semantics, world knowledge, and inference, with particular reference to how these are used to establish the coherence of discourses. In this talk, I report on new experimental evidence in support of a coherence-driven analysis, and describe how the analysis can accommodate a range of previous findings suggestive of conflicting preferences and biases. Case studies of four commonly-cited preferences are described, specifically (i) the parallel grammatical role preference (e.g., Smyth 1994), (ii) thematic role preferences (e.g., Stevenson et al. 1994), (iii) implicit causality biases (e.g., Caramazza et al. 1977), and (iv) the subject assignment strategy (e.g., Crawley et al. 1990). In each case, the experimental results offer an explanation of what the underlying source of the bias is, and predicts in what contexts evidence for it will surface. These results suggest that pronoun interpretation is incrementally influenced in part by the probabilistic expectations that hearers have about how the discourse will be coherently continued. They are also argued to leave various myths by the roadside, e.g., that pronoun interpretation can be profitably thought of as a 'search and match' procedure, and that coherence relations need not be controlled for in experimental stimuli. This talk includes joint work with Laura Kertz, Hannah Rohde, and Jeffrey Elman.
|
| 17 Apr 2009 | Rahul Bhagat |
Learning Paraphrases from Text (Ph.D. Defense practice talk)
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: Paraphrases are textual expressions that convey the same meaning using different surface forms. Capturing the variability of language, they play an important role in many natural language applications including question answering, machine translation, and multi-document summarization. In linguistics, paraphrases are characterized by approximate conceptual equivalence. Since no automated semantic interpretation systems available today can identify conceptual equivalence, paraphrases are difficult to acquire without human effort. The aim of this thesis is to develop methods for automatically acquiring and filtering phrase-level paraphrases using a monolingual corpus. Noting that the real world uses far more quasi-paraphrases than the logically equivalent ones, we first present a general typology of quasi-paraphrases together with their relative frequencies. To our knowledge the first one ever. We then present a method for automatically learning the contexts in which quasi-paraphrases obtained from a corpus are mutually replaceable. Knowing that quasi-paraphrases are often inexact because they contain semantic implications which can be directional, we present an algorithm called LEDIR to learn the directionality of quasi-paraphrases. Since semantic classes play a crucial role in our work, we also investigate the use of a semi-supervised clustering algorithm for learning semantic classes. We next investigate the task of learning surface paraphrases, i.e., paraphrases that do not require the use of any syntactic interpretation. Since one would need a very large corpus to find enough surface variations, we use a really large but unprocessed corpus of 150GB (25 billion words) obtained from Google News to do this learning. We show that these paraphrases can be used to learn surface patterns for relation extraction. Finally, we use paraphrases to learn patterns for domain-specific information extraction. Thus, in this thesis we define quasi-paraphrases, present methods to learn them from a corpus, and show that quasi-paraphrases are useful for information extraction.
|
| 27 Mar 2009 | David Chiang |
Tutorial on Hadoop
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: Hadoop is an open-source implementation of the Map/Reduce framework introduced by Google Research. It is a simple abstraction for describing parallelizable algorithms that admits very efficient execution: in one case, one of my (poorly implemented) algorithms was improved from a typical runtime of 72 hours to 3 hours. I will give a short introduction to Hadoop that is highly colored by my experiences with it and the likely experiences of other natural language processing researchers at ISI. I will show how to run Hadoop on HPC, how to use Hadoop Streaming (which allows implementation in any language you choose), and how to define Map/Reduce algorithms for a few incarnations of a typical NLP task, relative-frequency estimation of a large probability distribution. Input from others who are more experienced with Hadoop than I am is welcome! |
| 19 Mar 2009 | Rutu Mulkar |
Discovering Causal and Temporal Relations in Biomedical Texts (practice talk for AAAI Spring Symposium)
Time: 2:00 pm - 2:30 pm Location: 4th floor CR Abstract: In previous work on "Learning by Reading" we successfully extracted entities, states and events from technical natural language descriptions of processes. The research described here is aimed at the automatic discovery of causal and temporal ordering relations among states and events, specifically, among molecular and other events in biomedical articles. We have annotated causal and temporal relations in articles on the cell cycle, and we discuss our annotation guidelines and the issue of inter-annotator agreement. We then describe the natural language parsing and the inference system we use to extract these relations. We have created axioms manually for this system, focusing on temporal, causal and aspectual information and we have used semi-automatic means to augment these axioms. We have evaluated the performance of this system, and the results are promising. |
| 06 Mar 2009 | Andreas Maletti |
Minimizing Deterministic Weighted Tree Automata
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: Weighted tree automata are equivalent to weighted tree grammars, which can be used, for example, to easily model weighted context-free grammars. In constrast to context-free grammars, tree automata work directly on a tree representation and not on strings. We will introduce weighted tree automata and review the important results on minimization of them. For example, it is known that deterministic devices over commutative semifields (commutative semirings with multiplicative inverses) can be effectively minimized. In the main part of the talk, we present the first efficient algorithm for this minimization. If the operations can be performed in constant time, then our algorithm constructs an equivalent minimal (with respect to the number of states) deterministic automaton in time linear in the maximal rank of the input symbols, the number of (useful) transitions, and the number of states of the input automaton.
|
| 27 Feb 2009 | Carlos Busso (USC) |
Multimodal Processing of Human Behavior in Intelligent Instrumented Spaces: A Focus on Expressive Human Communication
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: Advances in technologies to capture and process multimedia signals are enabling new opportunities for understanding and modeling human behavior, and designing new human-centered applications. Intelligent environments equipped with a range of audio-visual sensors provide suitable means for automatically monitoring and tracking the behavior, strategies and engagement of the participants in multiperson interactions such as meetings, at various levels of interest. We describe a case study of a "Smartroom" being developed at USC in which high-level features are calculated from active speaker segmentations, automatically annotated by our system, to infer the interaction dynamics between the participants. The results show that it is possible to accurately estimate in real-time not only the flow of the interaction, but also how dominant and engaged each participant was during the discussion. Additionally, we describe analysis of human expressive behavior that can be afforded by such audio-visual data. We describe an analysis of the interrelation between facial gestures and speech using a multimodal approach. Using a controlled setting, motion capture technology was used to simultaneously acquire speech and detailed facial information. Our results indicate that the verbal and non-verbal channels of human communication are internally and intricately connected. The interplay is observed across the different communication channels such as various aspects of speech, facial expressions, and movements of the hands, head and body, and is greatly affected by the linguistic and emotional content of the message being communicated. As a result of the analysis, applications in automatic emotion recognition and synthesis of expressive communication are presented. [This research was supported in part by funds from the NSF, NIH, and the Department of the Army]
|
| 13 Feb 2009 | Joseph Tepperman (Signal Analysis and Interpretation Laboratory, USC) |
Estimating Subjective Judgments of Speech on Multiple Levels
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: People make explicit subjective judgments of speech when doing things like tutoring students in a foreign language, or testing a child's reading skills. On what do we base these judgments, and how can they be made automatically? The "quality" of speech does not exist on any one scale alone, and is not limited strictly to pronunciation - it is manifested through a multiplicity of simultaneous and interacting cues of various sizes. In this talk I'll discuss modeling strategies for categorical pronunciation on several scales, cognitive models for estimating student knowledge demonstrated through speech, and applications in the fields of education and speech synthesis.
|
| 30 Jan 2009 | Kevin Knight |
Sixty Years of Statistical Machine Translation
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: This high-level survey will describe the results of statistical machine translation (SMT) research since 1948. Part of the survey will cover the explosion of work in the past few years that has resulted from intense interest on the part of scientists, funders, and industry. We will also examine the roots of SMT in World War II decipherment activities. Some of the concepts from that era have become core to the field, while others still remain to be picked up. |
| 23 Jan 2009 | Roger Levy (UCSD) |
Noise and memory in rational human language comprehension
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: Considering the adversity of the conditions under which linguistic communication takes place in everyday life---ambiguity of the signal, environmental competition for our attention, speaker error, our limited memory, and so forth---it is perhaps remarkable that we are as successful at it as we are. Perhaps the leading explanation of this success is that (a) the linguistic signal is redundant, (b) diverse information sources are generally available that can help us obtain infer the intended message (or something close enough) when comprehending an utterance, and (c) we use these diverse information sources very quickly and to the fullest extent possible. This explanation can be thought of as treating language comprehension as a rational, evidential process. Nevertheless, there are number of prominent phenomena reported in the sentence processing literature that remain clear puzzles for the rational approach. In this talk I address three such phenomena, whose common underlying thread is an apparent failure to use information available in a sentence appropriately in global or incremental inferences about the correct interpretation of a sentence. I argue that the apparent puzzle posed by these phenomena for models of rational sentence comprehension may derive from the failure of existing models to appropriately account for the environmental and cognitive constraints---namely, noisy input and limited memory---under which comprehension takes place. I present two new probabilistic models of language comprehension under noisy input and limited memory, and show that these models lead to solutions to the above puzzles. More generally, these models suggest how appropriately accounting for environmental and cognitive constraints can lead to a more nuanced and ultimately more satisfactory picture of key aspects of human cognition. |
| 17 Dec 2008 | Liang Huang (UPenn => Google Research) |
Tree-based and Forest-based Translation
Time: 3:00 pm - 4:00 pm Location: 4th Floor CR Abstract: What is in common, and what is different, between translating from English to Chinese and compiling C++ into machine code? In this talk I will first introduce a tree-based (aka syntax-directed) paradigm for machine translation, inspired by both human translators and compilers. In this paradigm, a source language sentence is first parsed into a syntactic tree, which is then recursively converted into a target language sentence via tree-to-string transformation rules. Since the translation process is driven by the syntax, this approach resembles the classical "syntax-directed translation" method in compiling theory. However, natural languages are crucially different from programming languages in that they are fundamentally ambiguous. So we don't (and will probably never) have perfect parsers, and parsing errors adversely affect translation quality. To alleviate this problem, an obvious idea is to use the top-k parses, rather than a single 1-best, but this only helps a little bit due to the limited scope of the k-best list. We instead propose a "forest-based approach", which translates a packed forest encoding *exponentially* many parses in a compact (polynomial) space by sharing common subtrees. Large-scale experiments showed very significant improvements (over the 1-best baseline) in terms of translation quality, which outperforms the best reported systems to date. More interestingly, translating a forest of millions of trees is even faster than translating on top-30 individual trees thanks to dynamic programming. This talk includes joint work with Kevin Knight and Aravind Joshi (first part), and with Haitao Mi and Qun Liu (second/third parts).
Short Bio: Liang Huang recently completed his PhD study at the University of Pennsylvania, co-supervised by Aravind Joshi and Kevin Knight (USC/ISI). He is mainly interested in the theoretical aspects of computational linguistics, in particular, efficient algorithms in parsing and machine translation, generic dynamic programming, and formal properties of synchronous grammars. His thesis develops a set of "forest-based methods" that have been applied to many problems in NLP including k-best parsing, forest rescoring and reranking, and forest-based translation. His awards include an Outstanding Paper Award at ACL 2008, and a University Teaching Award at Penn in 2005.
http://www.cis.upenn.edu/~lhuang3/
|
| 07 Nov 2008 | Daniel Marcu |
The best/worst Speech Recognition, Language Modeling, and Machine Translation ideas
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: A group of 60 researchers have been asked to comment on what they perceive to be - the most important contributions in the fields of speech recognition, language modeling, and machine translation; - past ideas that failed to lead to substantial improvements; - and contributions that are most likely to have a material impact in the future.
This talk summarizes the perceptions and trends identified in the collection of answers provided by the researchers.
|
| 17 Oct 2008 | Jens Voeckler |
Parsing XRS with(out) regular expressions
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: If you ever needed to extract information, e.g. LHS, RHS words, features, etc., from an XRS rules, this talk is for you. Over the years, a variety of regular expressions have been used to obtain data from XRS rules. However, in light of recent pipeline efforts, the copy-n-paste culture lead to expressions that were sometimes too complex for the task at hand, unnecessarily slowing down processing steps, or too trivial to work correctly on boundary cases. A unified effort by Steve, David, Wei, Michael and Jens culminated in the NLPRules module for Perl. While the talk centers on the Perl module, and some surprising benchmark results, any language supporting libpcre (perl compatible regular expression) will benefit from the insights, and from knowing the right regular expression for the task at hand.
|
| 14 Oct 2008 | Victoria Fossum + David Chiang |
Practice talks for AMTA/EMNLP
Time: 3:00 pm - 4:15 pm Location: 11 Large Abstract: Using Bilingual Chinese-English Word Alignments to Resolve PP-Attachment Ambiguity in English (practice talk for AMTA) Errors in English parse trees impact the quality of syntax-based MT systems trained using those parses. Frequent sources of error for English parsers include PP-attachment ambiguity, NP-bracketing ambiguity, and coordination ambiguity. Not all ambiguities are preserved across languages. We examine a common type of ambiguity in English that is not preserved in Chinese: given a sequenc "VP NP PP", should the PP be attached to the main verb, or to the object noun phrase? We present a discriminative method for exploiting bilingual Chinese-English word alignments to resolve this ambiguity in English. On a heldout test set of Chinese-English parallel sentences, our method achieves 86.3% accuracy on this PP-attachment disambiguation task, an improvement of 4% over the accuracy of the baseline Collins parser (82.3%). Online Large-Margin Training of Syntactic and Structural Translation Features (practice talk for EMNLP) Minimum-error-rate training (MERT) is a bottleneck for current development in statistical machine translation because it is limited in the number of weights it can reliably optimize. Building on the work of Watanabe et al., we explore the use of the MIRA algorithm of Crammer et al. as an alternative to MERT. We first show that by parallel processing and exploiting more of the parse forest, we can obtain results using MIRA that match or surpass MERT in terms of both translation quality and computational cost. We then test the method on two classes of features that address deficiencies in the Hiero hierarchical phrase based model: first, we simultaneously train a large number of Marton and Resniks soft syntactic constraints, and, second, we introduce a novel structural distortion model. In both cases we obtain significant improvements in translation performance. Optimizing them in combination, for a total of 56 feature weights, we improve performance by 2.6 Bleu on a subset of the NIST 2006 Arabic-English evaluation data. (Joint work with Yuval Marton and Philip Resnik)
|
| 10 Oct 2008 | Sujith Ravi + Steve DeNeefe |
Practice talks for AMTA/EMNLP
Time: 3:00 pm - 4:15 pm Location: 11 Large Abstract: Automatic Prediction of Parser Accuracy (practice talk for EMNLP) Statistical parsers have become increasingly accurate, to the point where they are useful in many natural language applications. However, estimating parsing accuracy on a wide variety of domains and genres is still a challenge in the absence of gold-standard parse trees. We propose a technique that automatically takes into account certain characteristics of the domains of interest, and accurately predicts parser performance on data from these new domains. As a result, we have a cheap (no annotation involved) and effective recipe for measuring the performance of a statistical parser on any given domain. (Joint work with Kevin Knight and Radu Soricut)
Overcoming Vocabulary Sparsity in MT Using Lattices (practice talk for AMTA) Source languages with complex word formation rules present a challenge for statistical machine translation (SMT). In this paper, we take on three facets of this challenge: (1) common stems are fragmented into many different forms in training data, (2) rare and unknown words are frequent in test data, and (3) spelling variation creates additional sparseness problems. We present a novel, lightweight technique for dealing with this fragmentation, based on bilingual data, and we also present a combination of linguistic and statistical techniques for dealing with rare and unknown words. Taking these techniques together, we demonstrate +1.3 and +1.6 BLEU increases on top of strong baselines for Arabic-English machine translation. (Joint work with Ulf Hermjakob and Kevin Knight)
|
| 26 Sep 2008 | Eugene Charniak (Brown University) |
EM Works for Pronoun-Anaphora Resolution
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: EM (the Expectation Maximization Algorithm) is a well known technique for unsupervised learning (where one does not have any hand labeled solutions available, but instead one must learn from the raw text). Unfortunately EM is known to fail to find good solutions in many (most?) applications on which it is tried. In this talk we present some recent work on using EM to learn how to resolve pronoun-anaphora: determining that "the dog" is the antecedent of "he" and "his" in "When Sally fed the dog he wagged his tail". For this application EM works strikingly well, determining tens of thousands of parameters and resulting in a program that probably produces state of the art results, although because this is preliminary work, and pronoun-anaphora has no standard evaluation metrics, this is just a guess.
About the Speaker: Eugene Charniak is Professor of Computer Science. and Cognitive Science at Brown University. He received an A.B. degree in Physics from University of Chicago and a Ph.D. from M.I.T. in Computer Science. He has published four books: Computational Semantics, with Yorick Wilks (1976); Artificial Intelligence Programming (now in a second edition) with Chris Riesbeck, Drew McDermott, and James Meehan (1980, 1987); Introduction to Artificial Intelligence with Drew McDermott (1985); and Statistical Language Learning (1993). He is a Fellow of the American Association of Artificial Intelligence and was previously a Councilor of the organization. His research has always been in the area of language understanding or technologies which relate to it, such as knowledge representation, reasoning under uncertainty, and learning. Over the last few years he has been interested in statistical techniques for language understanding. His research in this area has included work in the subareas of part-of-speech tagging, probabilistic context-free grammar induction, and, more recently, syntactic disambiguation through word statistics, efficient syntactic parsing, and lexical resource acquisition through statistical means.
|
| 19 Sep 2008 | Fei Sha (USC) |
Large margin based parameter estimation for hidden Markov models
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: In many application domains, we face the task of characterizing the distribution of continuous random variables. For instance, in automatic speech recognition (ASR), these variables are acoustic properties of speech signals. For such tasks, Gaussian mixture models (GMMs) are widely used as an very effective density estimator. Particularly, in the context of ASR, they are embedded in continuous-density hidden Markov models (CD-HMMs) to yield emission probabilities, i.e., the likelihoods of acoustic observations conditioned on hidden states such as phonemes. Meanwhile, the transition probabilities in CD-HMMs attempt to capture temporal properties of speech signals. Similar modeling choices arise in other applications, for instance, in activity recognition. Various techniques have been developed to estimate the parameters of CD-HMMs. In particular, discriminative techniques such as conditional maximum likelihood and minimum classification error have attracted significant research attention. When carefully and skillfully implemented, they often lead to lower error rates (in speech recognition) than traditional techniques of maximum likelihood estimation. In this talk, I will describe a new discriminative technique that is based on the principle of large margin, a key framework in many machine learning algorithms including support vector machines and boosting. The new technique differs from previous discriminative methods for ASR in the goal of margin maximization. In particular, in our large margin training of CD-HMMs, model parameters are optimized to maximize the gap (or the margin) between correct and incorrect classifications. I will present an extensive empirical evaluation of our approach on two benchmark problems in speech recognition: phonetic classification and recognition on the TIMIT speech database. In both tasks, large margin systems obtain significantly better performance than systems trained by maximum likelihood estimation or competing discriminative frameworks. An in-depth analysis also reveals some interesting features of our approach, which contribute to the superior performance. Towards the end of the talk, I will discuss briefly the connection of our work to the structured prediction problems in the machine learning community. I will also discuss the future direction of this line of work and other application potentials.
|
| 22 Aug 2008 | Amittai Axelrod (UW) |
Intern Final Talk: Structural constraints for efficient decoding.
Time: 3:45 pm - 4:15 pm Location: 11 Large Abstract: String-to-tree machine translation decoders are effective but very slow, especially compared to other decoding approaches. We explore various methods to identify constraints on the search space, with the aim of improving the efficiency of the syntax-based decoder. |
| 22 Aug 2008 | Catalin Tirnauca (Univ. Rovira i Virgili) |
Intern Final Talk: On the Consistency of Probabilistic Context-Free Grammars
Time: 3:00 pm - 3:30 pm Location: 11 Large Abstract: Probabilistic context-free grammars can describe probability distributions over strings, i.e., the sum of probabilities of all generated strings is 1.This condition is often called consistency. It has applications in fields of natural language processing such as probabilistic parsing (disambiguate by picking the parse with the highest score), or speech recognition (rank hypotheses returned by a speech recognizer).
The talk is a survey of some of the previous results. We investigate how we can determine if a probabilistic context-free grammar is consistent, and if such a test can always be done. Also, we study a method, namely normalization, which guarantees consistent probabilistic context-free grammars. Moreover, we mention briefly some techniques that train probabilistic context-free grammars and guarantee consistency.
|
| 20 Aug 2008 | John DeNero (Berkeley) |
Intern Final Talk: Minimum Risk Decoding over Forests
Time: 3:45 pm - 4:15 pm Location: 11 Small Abstract: Minimum Bayes risk (MBR) decoding improves the output of machine translation systems by selecting a translation that matches a large proportion of the k-best hypotheses of a system. We extend this idea to apply to packed forests by selecting an output sentence that matches a large proportion of all hypotheses in the pruned forest of derivations from a syntax-based translation system. |
| 20 Aug 2008 | Kyle Gorman (Penn) |
Intern Final Talk: The Entropy of English given French
Time: 3:00 pm - 3:30 pm Location: 11 Small Abstract: The fundamental task in statistical machine translation (SMT) is to characterize the probability of a target sentence given its source translation; for translating French as English, P(f | e). By applying Bayes Rule, we derive the fundamental theorem of SMT: e maximizing P(e) P(f | e). Advances in SMT come from improving estimations of these two terms, or from more efficient ways of searching for optimal solutions (Brown et al. 1993).
In the case of language modeling, Shannon (1949) and Brown et al.
(1992) identified upper and lower bounds for the per-character entropy
of English, H(e), for humans and machines, respectively. We ask the
same question for SMT, H(e | f), comparing the results for human
translators and a simple machine baseline based on IBM Model 1. These
numbers are the upper and lower bounds for SMT systems trained on
parallel data.
|
| 18 Jul 2008 | Sujith Ravi |
Deciphering Ciphers Optimally Using Only Minimal Knowledge of the Source Language
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: I will be talking about deciphering letter-substitution ciphers *optimally* using only minimal knowledge (bigrams, trigrams, etc.) of the source language, instead of relying on large look-up dictionaries. We also plan to show how our empirical results compare with Shannon's predictions on the equivocation curves and unicity distance measure. |
| 11 Jul 2008 | Jon May |
Thesis Proposal Practice Talk: A Weighted Tree Transducer Toolkit for Syntactic Natural Language Processing Models
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: Solutions for many natural language processing problems such as speech recognition, transliteration, and translation have been described as weighted finite-state transducer cascades. The transducer formalism is very useful for researchers, not only for its ability to expose the deep similarities between seemingly disparate models, but also because expressing models in this formalism allows for rapid implementation of real, data-driven systems. Finite-state toolkits can interpret and process transducer chains using generic algorithms and many real-world systems have been built using these toolkits. Current research in NLP makes use of syntax-rich models that are poorly suited to extant transducer toolkits, which process linear input and output. Tree transducers can handle these models, and a weighted tree transducer toolkit with appropriate generic algorithms will lead to the sort of gains in syntax-based modeling that were achieved with string transducer toolkits. In this thesis proposal practice talk I will briefly trace the history of finite-state transducers and automata as they relate to natural language processing and the evolution of formalisms and the toolkits that support them, leading up to motivation for the design and creation of Tiburon, the toolkit referenced in this talk's title. I will describe previous, current, and future work on Tiburon's algorithms and the effectiveness of both algorithms and software at cleanly representing syntax-based NLP models from the literature and at constructing and evaluating novel models. |
| 13 Jun 2008 | Ellen Riloff |
Effective Information Extraction with Relevant Regions and Semantic Affinity Patterns
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: I will briefly overview the landscape of event-oriented information extraction (IE) systems and explain why it is especially challenging to learn IE systems without annotated training data. Then I will describe one attempt to do so by decoupling the tasks of finding relevant text regions and applying extraction patterns. First, a self-trained relevant sentence classifier identifies relevant regions in documents. Second, a "semantic affinity" measure identifies domain-relevant extraction patterns. We further distinguish between "primary" patterns and "secondary" patterns and apply the patterns selectively in the relevant regions. This approach is weakly supervised, requiring only a few seed patterns plus relevant and irrelevant (but unannotated) documents for training. The resulting IE system achieves reasonably good performance, despite the fact that the relevant region classifier leaves a lot to be desired. |
| 06 Jun 2008 | Tom Murray (USC) |
Knowledge as a Constraint on Uncertainty for Unsupervised Classification
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: This talk investigates the use of domain knowledge to constrain and improve the unsupervised learning of a classifier, by placing limits or biases on the possible hypotheses for each input. Theoretically, we view the contribution of the knowledge source as a reduction in the uncertainty of the model's decisions, quantified by the resulting conditional entropy of the label distribution given the input corpus. Evaluating on the simple case of an unsupervised HMM tagger, we find surprising levels of improvement from little knowledge, with more stable and efficient training convergence and label assignment, and a high degree of correlation between classification entropy and model performance. We conclude that, while we should always seek better generic models and techniques, for applications in an unsupervised setting, knowledge may still be key. |
| 30 May 2008 | Steve DeNeefe |
BLEU Sway Issues: one way to get statistical significance, two ways to get a better score, and three ways to thwart them
Time: 3:00 pm - 3:30 pm Location: 11 Large Abstract: BLEU the de facto standard for evaluation and development of statistical machine translation systems. We describe three real-world situations involving comparisons between different versions of the same systems where one can obtain improvements in BLEU scores that are questionable or even absurd. We propose a very conservative modification to BLEU that addresses these issues while improving correlation with human judgements, then explore some deeper modifications that alleviate the problems further. |
| 16 May 2008 | David Newman (UCI) |
Theory and Applications of Topic Modeling
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: Topic models, a class of Bayesian probabilistic models for discrete data, have recently gained popularity in applications ranging from document modeling to computer vision. Since the introduction of Latent Dirichlet Allocation (LDA) in 2003, there have been numerous extensions to this archetype. I will review the theory behind LDA, and discuss subsequent models, including (some of): Correlated Topic Model, Dynamic Topic Model, Hierarchical Topic Model, Special Words Topic Model, Hierarchical Dirichlet Process Model, Pachinko Allocation Machine, Topics and Syntax Model, Bi-LDA, Author-Topic Model, Supervised Topic Model, Spatial LDA, etc. |
| 09 May 2008 | John DeNero (Berkeley) |
Inference in phrase alignment models
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: Models that align phrases instead of words offer an appealing alternative to the standard relative frequency estimates of phrase translation probabilities. But, while some effective word alignment models (Model 1, Model 2 & HMM) can be estimated tractably with EM, phrase alignment models cannot. I'll talk about how to show that estimation and inference under these models is intractable. Then, I'll present two useful approximation techniques. First, I'll talk about how to cast phrase alignment search as an integer linear programming (ILP) problem and find the optimal alignment reliably and quickly with off-the-shelf ILP software. Some applications of this technique include training phrase alignment models and interpreting the output of word alignment models. Second, we'll look at how to estimate translation probabilities under a phrase alignment model using a Gibbs sampling procedure. The sampler has some nice asymptotic convergence properties and also seems to produce good results in practice. I'll walk through the different models we've trained and how they performed.
Time permitting, I'll also talk about some of the ways in which we
could potentially extend this work to syntactic MT.
|
| 02 May 2008 | Zornitsa Kozareva |
Semantic Class Learning from the Web with Hyponym Pattern Linkage Graphs
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: We present a novel approach to weakly supervised semantic class learning from the web, using a single powerful hyponym pattern combined with graph structures, which capture two properties associated with pattern-based extractions: popularity and productivity. Intuitively, a candidate is popular if it was discovered many times by other instances in the hyponym pattern. A candidate is productive if it frequently leads to the discovery of other instances. Together, these two measures capture not only frequency of occurrence, but also cross-checking that the candidate occurs both near the class name and near other class members. We developed two algorithms that begin with just a class name and one seed instance and then automatically generate a ranked list of new class instances. We conducted experiments on four semantic classes and consistently achieved high accuracies. |
| 25 Apr 2008 | David Chiang |
Tutorial: Randomized data structures for large statistical NLP models
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: Randomized algorithms are those which use randomness to achieve efficient performance with a bounded probability of error; typically, the bound is adjustable and the performance depends on the bound. Randomized data structures, likewise, use randomness to achieve efficient storage with a bounded probability of error. I will give an overview of the use of such data structures, namely, Bloom filters and "Bloomier" filters, for storing very large n-gram language models, and will discuss possibilities for using randomized data structures for other purposes as well. |
| 18 Apr 2008 | Rahul Bhagat |
Learning Paraphrases from Text
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: Paraphrases are textual expressions that convey the same meaning using different words. They capture variability, which is a common phenomenon in language. Given this, paraphrases have been shown to be useful in many natural language applications like Question-Answering, Machine Translation, Summarization and Information Retrieval. In this talk, I'll discuss the phenomenon paraphrasing and focus on methods for automatically acquiring paraphrases from text. |
| 11 Apr 2008 | Jon May |
Syntactic Re-Alignment Models for Machine Translation
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: We present a method for improving word alignment for statistical syntax-based machine translation that employs a syntactically informed alignment model closer to the translation model than commonly-used word alignment models. This leads to extraction of more useful linguistic patterns and improved BLEU scores on translation experiments in Chinese and Arabic. |
| 04 Apr 2008 | Ulf Hermjakob |
Name Translation in Statistical Machine Translation: Learning When to Transliterate
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: We present a method to transliterate names in the framework of end-to-end statistical machine translation. The system is trained to learn when to transliterate.
For Arabic to English MT, we developed and trained a transliterator on a
bitext of 7 million sentences and Google's English terabyte ngrams and
achieved better name translation accuracy than 3 out of 4 professional
translators. The talk also includes a discussion of challenges in name
translation evaluation.
|
| 25 Mar 2008 | Jason Riesa |
Tutorial on Arabic Orthography
Time: 10:30 am - 11:30 am Location: 11 Large Abstract: This tutorial is intended to provide attendees with working knowledge of the Arabic writing system. No previous experience with Arabic is required. At the end of this tutorial you should be able to read and segment individual Arabic characters, read common ligatures, identify possible affixes on stems, and understand the various lexical normalizations used in Arabic text preprocessing. The focus will be on the formal writing system in printed text for Modern Standard Arabic, although handwriting will be briefly discussed. |
| 18 Jan 2008 | Victoria Fossum |
Using Syntax to Improve Word Alignment Precision for Syntactic Machine Translation
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Automatically word-aligning a parallel bitext in the source and target languages constitutes the first stage of most statistical machine translation pipelines. Automatic word alignment is error-prone, and produces many incorrect links. Incorrect links that violate syntactic correspondences interfere with the extraction of string-to-tree transducer rules for syntactic machine translation. We present an algorithm for identifying and deleting incorrect word alignment links, using features of the extracted rules. We obtain gains in both alignment quality and translation quality in Chinese-English and Arabic-English translation experiments, relative to a GIZA++ union baseline. |
| 11 Jan 2008 | Kevin Knight |
How to Make EM Do What You Want
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: I'll talk about some unsupervised learning experiments -- how I was satisfied with the initial results, how I became very dissatisfied, and how I became (somewhat) satisified again. |
| 14 Dec 2007 | Marieke van Erp |
MITCH: Mining for Information in Texts from the Cultural Heritage
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Naturalis, the Dutch National Museum of Natural History, harbours one of the largest treasures of the world: the key specimens of millions of animals found throughout the world through centuries of biological expeditions. While the depot where the animals are stored is a technical marvel, Noah's ark of the 21st century, it is hard to search through it. Research in taxonomy, the evolution of life and biodiversity revolves around the specimens in the depot. The main key to accessing the depot are(mostly) handwritten expedition logs and registration books, which are currently being photographed and keyed in to be stored in searchable digital archives. Such digital logs already enable a kind of "Biogoogle" search, but actual research questions are more complicated ("how did this kind of frog develop over the last century in the Amazon rainforests?"), and demand more intelligent handling. This is where the MITCH project comes in. The goal of MITCH is to turn the field logs and registration books into a populated semantic network, in which concepts such as animal specimens are related to all other concepts that define them: where, when, under which circumstances and by whom were they found, who described them first in the academic literature, who prepared them for storage in the Naturalis depot, which registration number was assigned to them, etc. This means that all textual descriptions of a specimen need to be parsed into exactly these concepts and their relations. All of this needs to be done at a scale that goes far beyond the human capacity, as tens of thousands of digitized but unanalysed textual records are waiting for semantic analysis. This necessitates the use of state-of-the-art machine learning methods that learn from examples automatically.
The project addresses its goals on three levels. The basic level is the development and application of automatic data cleaning and markup tools. On top of this, semi-structured textual material such as fieldbook logs and scientific papers, are semi-automatically converted to a searchable knowledge base. Search results are visualised by displaying maps and specimen photos. The conversion phase assumes the active intervention of domain experts, such as collection managers, to correct and steer the automatic extraction procedure. At the top level, information resources are cross-linked using a domain ontology, populating a semantic network that can be hooked up to any other standardised cultural heritage knowledge base or to a search engine.
|
| 02 Nov 2007 | Bill Rounds (Michigan and Stanford) |
Constructions, Constraints, Transducers, and TAGs: A unifying view through Feature Logic
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: The value of mathematical formalisms for speech recognition, language generation, and machine translation has long been recognized. Not so much work, though, has been spent reconciling these formalisms with linguistic theories. In this talk I'll propose a theoretical descriptive mechanism based on feature logic, which is central to construction and constraint-based linguistic theories like construction grammar and HPSG, and which can be used to view tree transducers and tree-adjoining grammars as giving rise to a construction-based framework. |
| 19 Oct 2007 | Slav Petrov (Berkeley) |
Learning and Inference for Hierarchically Split PCFGs
Time: 10:30 am - 11:30 am Location: 11 Large Abstract: Treebank parsing can be seen as the search for an optimally refined grammar consistent with a coarse training treebank. We describe a method in which a minimal grammar is hierarchically refined using EM to give accurate, compact grammars. The resulting grammars are extremely compact compared to other high-performance parsers, yet the parser gives the best published accuracies on several languages, as well as the best generative parsing numbers in English. In addition, we give an associated coarse-to-fine inference scheme which vastly improves inference time with no loss in test set accuracy. |
| 17 Oct 2007 | Jon Patrick (Univ. of Sydney) |
Enhancement Technologies for ICU Information Systems
Time: 3:30 pm - 4:30 pm Location: 11 Large Abstract: The School of Information Technologies at the University of Sydney has had a 3 year partnership with the Intensive Care Unit at the Royal Prince Alfred Hospital, Sydney. In that time they have managed 8 joint projects aimed at producing software solutions that enhance productivity in the Unit and in some cases enabled entirely new functionalities in their information systems. The principle motivation for the research is the processing of the narratives in clinical notes but concomitant problems in information systems have also been tackled and the combination of the two disciplines have led to the two related processing systems to be described in this presentation.
- Ward Rounds Information Systems (WRIS) & Handovers - The WRIS is designed to support the work of all clinical staff in their ward rounds activities. The system, when activated, automatically populates from the resident clinical database a pro forma report with the most recent relevant data about the patient, such as vital signs, pathology reports, and other diagnostic measurements, presented as a web page. The clinical staff then write their progress notes into the web page which converts the text to SNOMED CT codes and other relevant concepts and entities. The clinician is given the opportunity to change any analyses done by the processor. This clinician approved data is loaded to the patient record. The essential elements of this system, that is computing an extract of the patient record, accepting narrative input, and analysing the text for coding, is a productivity gain of itself, but more importantly, also constitutes the beginning of a hospital wide Handovers System for use throughout each step in the patient journey. This system is being tested at the RPAH ICU in readiness for ward usage. The impact of this system in improving the quality and safety of handovers has the potential to be very significant.
- Clinical Data Analytics Language (CDAL) - General purpose access to data from clinical information systems, beyond retrieval for point of care work, is needed for many aspects of the hospital's work particularly for clinical research, logistics & operational planning, and auditing patient safety. Most current clinical systems only provide access to data identified in standard reports with no flexibility to make ad hoc enquiries or to pursue new directions of enquiry. The clinical data analytics language developed enables the expression of any question that can be answered from the data in the database in a restricted natural language. A prototype of the language has been developed for the CareVue information system used in the ICU at the Royal Prince Alfred Hospital. It provides for the use of local medical dialects, SNOMED CT terminology including all forms of collective expressions in SNOMED (e.g. infectious diseases), specification of patient groups, a variety of statistical functions, and constraints over any medical variable, Time, and Location. CDAL is general in that it can be bolted on to any clinical information system and is applicable to any clinical specialisation.
|
| 12 Oct 2007 | David Talbot (Edinburgh) |
Scalable Language Modeling: Breaking the Curse of Dimensionality
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Randomized data structures can help us scale discrete models encountered in NLP. This talk will describe their use in language modeling and present some more general related results. N-gram language models are fundamental to speech recognition and machine translation. Unfortunately, the n-gram parameter space grows exponentially with the dimension of the feature vector. I will describe how randomization can be used to remove the space-dependency of such models on the a priori parameter space. The novel extensions of the Bloom filter that I will present are able to take advantage of the entropy of the distribution of values assigned to feature vectors to save space in a discrete statistical model. I will review some results applying these models to language modeling in machine translation and relate their space-requirements to a novel lower bound on the general problem of querying a map of key/value pairs. No prior knowledge of randomized data structures will be assumed.
|
| 05 Oct 2007 | Sujith Ravi |
Will this parser work with my data? - Predicting Parser Accuracy without Gold-Standard information
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: There are many tools available to the NLP community for Natural Language Parsing, (i.e converting a raw sentence in to a parse-tree). NLP researchers usually use some "off-the-shelf" parser which has been trained on the Wall Street Journal (WSJ) corpora and then apply the WSJ-trained parser to their data. This works in many cases, especially for systems which use data from WSJ or similar corpora. However, in real life applications, the data may be compiled from many different sources and span different genres, and may not be similar to the WSJ corpora in terms of sentence structure, etc . A particular parser might parse well on some corpora and not so well on others. Choosing the right parser for your data may have an impact on the performance of the NLP system as a whole. But in order to measure the accuracy of any parser for a given corpus, we require a set of gold-standard parse trees corresponding to the sentences within the corpus. Generating gold-standard set takes a lot of manual work and in many real-life applications, it is not a feasible task to generate gold-standard parses for large corpora.
We attempted to build a system which can predict the accuracy (in terms of f-measure value) of the Charniak parser (a popular parsing tool) on any given sentence corpus. Without using any additional information (i.e gold std. parses), our system predicts "how accurately the Charniak parser could parse the given corpus". In order to evaluate our system's predictions on a particular corpus, we compute the "Correlation" measure between the "actual accuracies (using Gold-standard)" vs. "predicted accuracies (from our system)" for the given corpus. We tested our system on different corpora and using different methods and will present these results.
|
| 29 Aug 2007 | Carmen Heger (Dresden) Michael Bloodgood (Delaware) |
Summer Intern Presentations: Composition of Tree Transducers AND Using the Perceptron Algorithm to Tune Large Numbers of Feature Weights for Syntax-Based Statistical Machine Translation
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Composition of Tree Transducers Since finite state (string) transducers are not expressive enough for many NLP applications, computational linguistics started to investigate tree transducers for the task of machine translation, for example. Quite some successful work has been done on generalizing results from string transducers to tree transducers. But when it comes to composition results are not satisfying because generally tree transducers are not closed under composition. Still we think that most of the tree transducers used in NLP are composable and that is why we defined the problem of the composition for two individual transducers instead of the whole class. During the summer we started with linear nondeleting tree transducers with epsilon rules and approached an algorithm to decide for two such transducers whether their composition is again in the same class. Using the Perceptron Algorithm to Tune Large Numbers of Feature Weights for Syntax-Based Statistical Machine Translation
Current state-of-the-art syntax-based statistical machine translation
systems produce many candidate translations out of which the output translation
is selected by taking the argmax over all candidates i of <w,f_i> where w is a
weight vector and f_i is a vector of the feature values for candidate i. The
features used by the system and their corresponding weights have a major impact
on a system's performance. Currently, Minimum Error Rate Training (MERT) is used to
tune the weights of the features. A drawback of this is that it isn't tractable
to tune large numbers of feature weights. I will discuss using the perceptron
algorithm to tune feature weights for statistical machine translation. If I get interesting
results before my talk, I may also dicsuss new classes of features (potentially very large
numbers of features) that can be used for improving MT performance.
|
| 24 Aug 2007 | Wei Ho (Princeton) Jennifer Gillenwater (Rice) |
Summer Intern Presentations: Noisy Language Models AND Context for Syntax-Based Translation Rules
Time: 3:30 pm - 5:00 pm Location: 11 Large Abstract: Noisy Language Models The language models used in statistical machine translation are often quite large, requiring significant memory and sometimes pre-processing in order to be utilized effectively. It would be desirable to have a more compact representations of language models while minimizing the impact on translation quality. Various quantization methods and lossy storage of language models will be presented. Context for Syntax-Based Translation Rules The rules that a translation system employs should be applicable in many contexts. This ensures that a rich language is expressible with a minimum number of rules. However, when rules that are applicable in too many contexts are combined, they result in nonsensical translations. How can we keep rules general but constrain the context of their use? This summer we explored the approach of constraining the context by conditioning on various neighboring elements of each rule.
|
| 16 Aug 2007 | Anoop Sarkar (Simon Fraser) |
Extensions of Regular Tree Grammars and their relation to Tree Adjoining Grammars
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: There is a hierarchy of generative devices that generate trees: starting with regular tree languages (RTLs), which are contained within context-free tree languages (CFTLs), and so on. The string yield of the RTLs is exactly the set of Context-Free Languages, while the yield of the CFTLs is exactly the set of Indexed Languages. In this talk we introduce Adjoining Tree Languages (ATLs) which sit in between RTLs and CFTLs. The yield of ATGs is exactly the set of Tree-Adjoining Languages. Just like RTGs are stronger than CFGs, ATGs are stronger than TAGs. In addition we will show that the ATG notation simplifies many of the foundational proofs for TAGs including proofs of the closure properties. In particular, ATLs do not use adjunction constraints, and thus are much easier to understand than TAGs. We compare ATGs with previously proposed simplifications of CFTGs, called monadic simple CFTGs, which also have been shown to be weakly equivalent to TAG (i.e. they generate the same set of string languages). We consider the question of whether these two weakly equivalent formalisms are strongly equivalent (i.e. generate exactly the same set of tree languages). Finally, we will show that the standard definition used for probabilistic TAG is (surprisingly) very different from the natural definition of probabilistic ATL. Using an example of PP-attachment ambiguity we show that the two probabilistic models are different from each other. About the speaker: Anoop Sarkar is an assistant professor in the Department of Computing Science at Simon Fraser University. He received his PhD in 2002 from the Department of Computer and Information Science at the University of Pennsylvania, with Prof. Aravind Joshi as his advisor. His research work is on machine learning, especially semi-supervised learning, applied to the processing of natural language and stochastic formal grammars.
Anoop Sarkar's web-page: http://www.cs.sfu.ca/~anoop
|
| 15 Jun 2007 | Donghui Feng |
Extracting Data Records from Unstructured Biomedical Full Text
Time: 11:00 am - 11:30 am Location: 11 Large Abstract: In this paper, we address the problem of extracting data records and their attributes from unstructured biomedical full text. There has been little effort reported on this in the research community. We argue that semantics is important for record extraction or finer-grained language processing tasks. We derive a data record template including semantic language models from unstruc-tured text and represent them with a dis-course level Conditional Random Fields (CRF) model. We evaluate the approach from the perspective of Information Extrac-tion and achieve significant improvements on system performance compared with other baseline systems. |
| 15 Jun 2007 | Alex Fraser |
Getting the structure right for word alignment: LEAF
Time: 10:30 am - 11:00 am Location: 11 Large Abstract: Automatic word alignment is the problem of automatically annotating parallel text with translational correspondence. Previous generative word alignment models have made structural assumptions such as the 1-to-1, 1-to-N, or phrase-based consecutive word assumptions, while previous discriminative models have either made one of these assumptions directly or used features derived from a generative model using one of these assumptions. We present a new generative alignment model which avoids these structural limitations, and show that it is effective when trained using both unsupervised and semi-supervised training methods. Experiments show strong improvements in word alignment accuracy and usage of the generated alignments in hierarchical and phrasal SMT systems improves the BLEU score. |
| 08 Jun 2007 | Jonathan May |
Bisimulation Minimisation for Weighted Tree Automata
Time: 3:30 pm - 4:00 pm Location: 11 Large Abstract: We describe existing forward and backward bisimulation minimisation algorithms for nondeterministic automata and extend these algorithms to weighted tree automata. The extended algorithms, which work for all semirings, retain the time complexity of their counterparts for unweighted tree automata for additively cancellative semirings, and are only slightly higher (linear instead of logarithmic in the number of states) on other semirings. We describe the effectiveness of an implementation of these algorithms on a typical task in natural language processing.
This is joint work with Johanna Hogberg, Umea University and Andreas
Maletti, Technische Universitat Dresden.
|
| 08 Jun 2007 | Liang-Chih Yu (Cheng Kung U) |
Topic Analysis for Psychiatric Document Retrieval (Practice Talk for ACL)
Time: 3:00 pm - 3:30 pm Location: 11 Large Abstract: Psychiatric document retrieval attempts to help people to efficiently and effectively locate the consultation documents relevant to their depressive problems. Individuals can understand how to alleviate their symptoms according to recommendations in the relevant documents. This work proposes the use of high-level topic information extracted from consultation documents to improve the precision of retrieval results. The topic information adopted herein includes negative life events, depressive symptoms and semantic relations between symptoms, which are beneficial for better understanding of users' queries. Experimental results show that the proposed approach achieves higher precision than the word-based retrieval models, namely the vector space model (VSM) and Okapi model, adopting word-level information alone. About the speaker: Liang-Chih Yu (http://www.isi.edu/~liangchi) is now a visiting student in the Information Sciences Institute (ISI) of University of Southern California (USC). My host advisor is Dr. Eduard Hovy. I am also a PhD candidate in the Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan. My advisor is Dr. Chung-Hsien Wu. My research interests include natural language processing, text mining, information retrieval, ontology construction, spoken dialogue system.
|
| 01 Jun 2007 | Andrew S. Gordon |
Generalizing Semantic Role Annotations Across Syntactically Similar Verbs
Time: 3:30 pm - 4:00 pm Location: 11 Large Abstract: Large corpora of parsed sentences with semantic role labels (e.g. PropBank) provide training data for use in the creation of high-performance automatic semantic role labeling systems. Despite the size of these corpora, individual verbs (or rolesets) often have only a handful of instances in these corpora, and only a fraction of English verbs have even a single annotation. In this paper, we describe an approach for dealing with this sparse data problem, enabling accurate semantic role labeling for novel verbs (rolesets) with only a single training example. Our approach involves the identification of syntactically similar verbs found in PropBank, the alignment of arguments in their corresponding rolesets, and the use of their corresponding annotations in PropBank as surrogate training data. |
| 01 Jun 2007 | Jingbo Zhu |
Active Learning for Word Sense Disambiguation with Methods for Addressing the Class Imbalance Problem
Time: 3:00 pm - 3:30 pm Location: 11 Large Abstract: In this paper, we analyze the effect of resampling techniques, including under-sampling and over-sampling used in active learning for word sense disambiguation (WSD). Experimental results show that under-sampling causes negative effects on active learning, but over-sampling is a relatively good choice. To alleviate the within-class imbalance problem of over-sampling, we propose a bootstrap-based over-sampling (BootOS) method that works better than ordinary over-sampling in active learning for WSD. Finally, we investigate when to stop active learning, and adopt two strategies, max-confidence and min-error, as stopping conditions for active learning. According to experimental results, we sug-gest a prediction solution by considering max-confidence as the upper bound and min-error as the lower bound for stopping conditions. |
| 25 May 2007 | Wei Wang (Language Weaver) |
Binarizing Syntax Trees to Improve Syntax-Based Machine Translation Accuracy
Time: 3:00 pm - 3:30 pm Location: 11 Large Abstract: We show that phrase structures in Penn Treebank style parses are not optimal for syntax-based machine translation. We exploit a series of binarization methods to restructure the Peen Treebank style trees such that syntactified phrases smaller than Penn Treebank constituents can be acquired and exploited in translation. We find that by employing the EM algorithm for determining the binarization of a parse tree among a set of alternative binarizations gives us the best translation result. |
| 18 May 2007 | Feng Pan |
Computing Semantic Similarity between Skill Statements for Approximate Matching
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: (This will be an extended version of the talk for NAACL-HLT 2007. It's based on my summer internship work at IBM T.J. Watson Research Center last year.) The project aimed to address the problems encountered when trying to match available employees to open job positions, based on skill matches. Currently, job search applications, like IBM's Professional Marketplace, only find exact matches. A skill affinity computation is desired to allow searches to be expanded to related/similar skills, and return more potential matches. In this talk, I will explore the problem of computing text similarity between verb phrases describing skilled human behavior for the purpose of finding approximate matches. Four parsers (Charniak's parser, Stanford's parser, IBM XSG slot grammar parser, and Lin's MINIPAR) are evaluated on a corpus of skill statements extracted from an enterprise-wide expertise taxonomy. A similarity measure utilizing common semantic role features extracted from parse trees was found superior to an information-theoretic measure of similarity and comparable to the level of human agreement.
|
| 11 May 2007 | Steve DeNeefe |
What Can Syntax-based MT Learn from Phrase-based MT?
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: We compare and contrast the strengths and weaknesses of a syntax-based machine translation model with a phrase-based machine translation model on several levels. We briefly describe each model, highlighting points where they differ. We include a quantitative comparison of the phrase pairs that each model has to work with, as well as the reasons why some phrase pairs are not learned by the syntax-based model. We then propose improvements to the syntax-based extraction techniques to capture more phrases. We also compare the translation accuracy for all variations. |
| 04 May 2007 | Sheelagh Carpendale (Calgary) |
Information Visualization and Collaboration
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Consider Donald Norman's quote, "The power of the unaided mind is highly overrated. Without external aids, memory, thought, and reasoning are all constrained. But human intelligence is highly flexible and adaptive, superb at inventing procedures and objects that overcome its own limits. The real powers come from devising external aids that enhance cognitive abilities." (Norman, 1993) Common methods for externalization include making sketches on whatever happens to be handy -- paper napkins, program margins, etc. -- and/or finding a colleague or two to discuss the problem with. It would seem then, that visualization and collaboration are natural possibilities for creating positive cognitive aids. I will discuss our approach to developing interactive information visualizations both to support individuals and small groups of collaborators and briefly describe some of our recent results. About the speaker:
Sheelagh Carpendale holds a Canada Research Chair in Information
Visualization at the University of Calgary. Her research focuses on
the visualization, exploration and manipulation of information;
visualizing such topics as ecological dynamics, uncertainty in
information, social and communication information and investigating
the development of information visualization environments that support
collaboration. Dr. Carpendale's research in information visualization
and interaction design draws on her dual background in Computer
Science (BSc. and Ph.D. Simon Fraser University) and Visual Arts
(Sheridan College, School of Design and Emily Carr, College of Art).
|
| 20 Apr 2007 | Christopher Collins (Toronto) |
Information Visualization to Support Computational Linguistics
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: We present a survey of resent research into using information visualization to reveal new insights about linguistic data. Our recent work includes using WordNet hyponymy as a basis for document visualization and visualizing the uncertainty in machine translation in an instant messaging chat context. We will present our preliminary findings and prototype visualization for machine translation data resulting from a week of collaboration with ISI researchers. About the speaker: Christopher Collins is a PhD candidate in information visualization and computational linguistics at the University of Toronto. He works with Prof. Gerald Penn and Prof. Sheelagh Carpendale (University of Calgary).
|
| 30 Mar 2007 | Ido Dagan (Bar-Ilan U) |
Textual entailment as a framework for applied semantics
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: We have recently proposed Recognizing Textual Entailment (RTE) as a generic task that captures major semantic inferences across different natural language processing applications. The talk will first review the motivation and definition of the textual entailment task and the PASCAL RTE-1,2&3 Challenges benchmarks. Then we will demonstrate directions for building textual entailment systems, based on knowledge acquisition and inference, and for utilizing them within concrete applications. Furthermore, we suggest that textual entailment modeling may become a comprehensive framework for applied semantics research. Such framework introduces useful variants of known semantic problems and highlights important tasks which were hardly investigated so far at an applied computational level. The semantic modeling perspective will be illustrated in more detail by a case study for an entailment-based variant of word sense disambiguation. About the speaker:
Ido Dagan is a Senior Lecturer at the Department of Computer Science
at Bar Ilan University, Israel. His areas of interest are largely
within empirical NLP, particularly empirical approaches for applied
semantic processing. In the last few years Ido and his colleagues
introduced textual entailment as a generic framework for applied
semantic inference and have organized the first three rounds of the
PASCAL Recognizing Textual Entailment Challenges. Ido received his
Ph.D. from the Technion. He has been a research fellow at the IBM
Haifa Scientific Center and a Member of Technical Staff at AT&T Bell
Laboratories. During 1998-2003 he was co-founder and CTO of
FocusEngine and VP of Technology of LingoMotors.
|
| 23 Mar 2007 | Hermann Helbig (U at Hagen, Germany) |
Multilayered Extended Semantic Networks as a Knowledge Representation Paradigm and Interlingua for Meaning Representation
Time: 3:00 pm - 4:30 pm Location: 4 CR Abstract: The talk gives an overview of Multilayered Extended Semantic Networks (abbreviated MultiNet), which is one of the most comprehensively described knowledge representation paradigms used as a semantic interlingua in large-scale NLP applications and for linguistic investigations into the semantics and pragmatics of natural language. As with other semantic networks, concepts are represented in MultiNet by nodes, and relations between concepts are represented as arcs between these nodes. Additionally to that, every node is classified according to a predefined conceptual ontology forming a hierarchy of sorts, and the nodes are embedded in a multidimensional space of layer attributes and their values. MultiNet provides a set of about 150 standardized relations and functions which are described in a very concise way including an axiomatic apparatus, where the axioms are classified according to predefined types. The representational means of MultiNet claim to fulfill the criteria of universality, homogeneity, and cognitive adequacy. In the talk, it is also shown, how MultiNet can be used for the semantic representation of different semantic phenomena. To overcome the quantitative barrier in building large knowledge bases and semantically oriented computational lexica, MultiNet is associated with a set of tools including a semantic interpreter NatLink for automatically translating natural language expressions into MultiNet networks, a workbench LIA for the computer lexicographer, and a workbench MWR for the knowledge engineer for managing and graphically manipulating semantic networks. The applications of MultiNet as a semantic interlingua range from natural language interfaces to the Internet and to dedicated databases, over question-answering systems, to systems for automatic knowledge acquisition. About the speaker: Prof. Helbig is head of the chair Intelligent Information and Communication Systems at the University of Hagen, Germany. His main research areas are Knowledge Representation, Semantic Natural Language Processing, and Question-Answering.
A CV can be found here.
|
| 09 Mar 2007 | Kevin Knight |
The Voynich Manuscript
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: The medieval Voynich Manuscript has been called "the most mysterious document in the world". Its pages contain bizarre drawings of strange plants and astrological diagrams, as well as an undeciphered script of 20,000 running words, written in a character set that has never been seen elsewhere. Its origin is also controversial, with many theories abounding. I will describe the document, show samples, explain where it may have come from, and present some properties of the text.
This will more of a history/mystery talk than
a computer science talk.
|
| 26 Jan 2007 | Gerald Penn (Toronto) |
The Quantitative Study of Writing Systems
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: If you understood all of the world's languages, you would still not be able to read many of the texts that you find on the world wide web, because they are written in non-Roman scripts -- often ones that have been arbitrarily encoded for electronic transmission in the absence of an accepted standard. This very modern nuisance reflects a dilemma as ancient as writing itself: the association between a language as it is spoken and its written form has a sort of internal logic to it that we can comprehend, but the conventions are different in every individual case --- even among languages that use the same script, or between scripts used by the same language. This conventional association between language and script, called a writing system, is indeed reminiscent of the Saussurean conception of language itself, a conventional association of meaning and sound, upon which modern linguistic theory is based. Despite linguists' reliance upon writing to present and preserve linguistic data, however, writing systems were a largely forgotten corner of linguistics until the 1960s, when Gelb presented their first classification.
This talk will describe recent work that aims to place the study of
writing systems upon a sound computational and statistical foundation.
While archaeological decipherment may eternally remain the holy grail
of this area of research, it also has applications to speech
synthesis, machine translation, and multilingual document retrieval.
|
| 12 Jan 2007 | Kevin Knight |
Capturing Natural Language Transformations
Time: 2:00 pm - 3:30 pm Location: 11 Large Abstract: Knowledge representation is hard. As natural language scientists and engineers, we'd like something that - is expressive enough to capture how natural language works - permits tractable inference - admits learning algorithms for automatic knowledge acquisition - leads to modular system construction This talk will look at knowledge representation for capturing natural language transformations. A lot of what we do falls into this category. Examples of transformations include language translation (French to English), question answering (Question to Answer), transliteration (foreign script to Roman alphabet), summarization (long text to short text), parsing (string to tree), language generation (meaning to string), etc. I'll show various knowledge formats (starting with simple finite-state transducers) and show how they stack up on the 4 criteria above, using theorems and examples. We'll see that different types of tree and string automata lead to good behavior on various subsets of the 4 criteria, but getting 4 out of 4 is still elusive.
This is a Krazy Theory talk -- since this kind of talk should not go
on and on, I promise to finish within 50 minutes.
|
| 05 Jan 2007 | Beata Klebanov (Hebrew U) |
Experimental and Computational Investigation of Lexical Cohesion in Texts
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Lexical cohesion refers to structure created in a text by use of words with related meanings. Apart from its importance in theoretical and applied linguistics, lexical cohesion detection is used in NLP tasks like topic segmentation, extractive summarization, spelling correction, etc. However, the intuitive potential of lexical cohesion for such tasks is often not realized in practice, possibly due to shortcomings of detection algorithms. I will briefly describe an experiment with readers aimed at providing reliable data for a computational investigation of lexical cohesion. We then discuss a number of informative features for cohesion detection, drawing on sources like WordNet, distributional information, free associations, and the structure of information in the text itself. Finally, I report experiments with supervised learning of lexical cohesion. About the speaker:
Beata Beigman Klebanov is a PhD candidate at the Hebrew University of Jerusalem,
Israel, currently a visiting scholar at Northwestern University. Beata's
interests are in experimental, computational and applied research in text
pragmatics.
|
| 15 Dec 2006 | Jerry Hobbs |
When Will Computers Understand Shakespeare?
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: In this talk I will examine problems encountered in coming to some kind of understanding of one sonnet by Shakespeare (his 64th), ask what it would take to solve these problems computationally, and suggests routes to the solution. The general conclusion is that we are closer to this goal as one might think. Or are we? Bio:
Jerry Hobbs is famous primarily for having an office next to Kevin
Knight's and a parking space next to Ed Hovy's. He has read
everything of Shakespeare's that survives, including his will and
plays of dubious authorship. But that was all a long time ago.
|
| 14 Dec 2006 | Liang Huang (Penn) |
Faster Decoding with Synchronous Grammars and n-gram Language Models
Time: 1:30 pm - 3:00 pm Location: 11 Large Abstract: A major obstacle in syntax-based machine translation is the prohibitively large search space for decoding with an integrated language model. We develop faster approaches for this problem based on lazy algorithms for k-best parsing. When comparing against Chiang's technique of cube pruning, our method runs up to twice as fast without making more search errors or decreasing translation accuracy as measured by BLEU. We demonstrate the effectiveness of the algorithm on a large-scale translation system. Interestingly, these techniques can be applied to speed up bilexical parsing as well, where the (bi-) lexical probabilities can be viewed as n-gram probabilities that causes non-monotonicity. This method fits naturally into the coarse-to-fine grained multi-pass parsing schemes. To push this direction even further, we can generalize cube and lazy cube pruning as generic tools for reducing complicated search spaces, as alternatives to the well-known A* and annealing techniques.
This is joint work with David Chiang (ISI).
|
| 27 Nov 2006 | Mark Hopkins (Potsdam) |
Towards the Effective Exploitation of Syntax in Machine Translation
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: We discuss preliminary work on a possible approach to exploiting syntax in an effective way for machine translation. The driving guideline is to devise a machine translation system that can perform effectively, given a very limited quantity of parsed training data. |
| 17 Nov 2006 | David DeVault (Rutgers) |
Scorekeeping in an Uncertain Language Game
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Practical dialogue systems must exploit context to interpret user utterances correctly. Received views of context and coordination in pragmatic theory equate utterance context with the occurrent subjective states of interlocutors using notions like common knowledge or mutual belief. We argue that these views are not well suited for practical modeling due to the uncertainty and robustness of context dependence in human-human dialogue. We present an alternative characterization of utterance context as objective and normative. On this view, an interlocutor's representation of context reflects private uncertainty about the true objective context as determined by prior speaker meanings. As conversation moves forward, new utterances provide interlocutors with retrospective insight about each other's prior meanings and therefore about what the true context really is. This view reconciles the need for uncertainty with received intuitions about coordination, and can directly inform computational approaches to dialogue. Joint work with Matthew Stone, Rutgers and Rich Thomason, Michigan About the Speaker:
David DeVault is a Ph.D. candidate in the Department of Computer
Science at Rutgers University. He holds a B.S. in Engineering and
Applied Science from the California Institute of Technology and an
M.A. in Philosophy from Rutgers University. David's research aims to
develop techniques to allow computational agents to participate in
flexible task-oriented conversations with human beings. His recent
work has drawn on design challenges encountered in building such an
agent to try to articulate practical, learnable, and theoretically
satisfying representations of context, utterance meaning, and speaker
intention for implemented conversational systems.
|
| 03 Nov 2006 | Jens-Soenke Voeckler |
perl part 2 - advanced magick
Time: 3:30 pm - 5:00 pm Location: 11 Large Abstract: Since part 1 of the Perl tutorial didn't cover the juicy bits (like a unique function in Perl), based on feedback from participants, I am offering a part 2 "Perl - Advanced Magick" covering: o the slides from roughly page 40 - The Schwartzian Transform - Dissecting a program o What to do, if you do need popen or backticks? o OO Perl - a start o C embedding - definitely only a "start here" o Useful recipes, e.g. interpolating variables in configuration scripts from Perl values.
If there is something you are especially interested in seeing, please
send me an email
|
| 23 Oct 2006 | Jens-Soenke Voeckler |
perl - how to use it, not abuse it
Time: 12:00 pm - 1:30 pm Location: 11 Large Abstract: If you speak a little perl, are an occasional perl-scripter, and would like to know more about how to use it as a (p)ortable, (e) fficient, and (r)eadible (l)anguage, you may be interested in my brown bag (read: bring your own) lunch seminar:
I will talk about using Perl in a portable fashion, the environment
it is run in, and how avoid common mistakes and misconceptions. Perl
offers more than a thousand ways to solve a problem, but some are
more portable or more efficient than others. If time permits, simple
hands-on examples can be tried out during the talk, so power for
laptops will be provided.
|
| 29 Sep 2006 | Ashish Venugopal (CMU) |
Delayed LM Intersection and Left-to-Right N-Best Extraction for Syntax-Based MT
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: We begin by describing a set of pruning constraints that are applied in the literature to effectively restrict the search space of synchronous PCFGs intersected with target language model contexts. We apply these constraints to non-binarized grammars with a large number of non-terminals and demonstrate effective parsing within the framework of Wu, 97. We then present a novel parsing approach that avoids language model context intersection during parsing in favor of language model driven n-best list extraction. The parsing step produces a sentence spanning parse forest which is explored in left-to-right target order by the N-Best extraction method. This method avoids lossy pruning during the parsing process, searching a much larger effective parse space than practically possible in the full intersection scenario, and has the important benefit of allowing integration of a high order language within the N-Best search process, rather than only in parse re-scoring. We demonstrate the impact of this parsing approach using the SPCFG approach described in Zollmann, Venugopal, Vogel 06, which is similar to Galley et al., 04 and compare performance against full intersection. This is joint work with Andreas Zollmann About the Speaker: Ashish Venugopal is a Ph.D candidate at the Language Technologies Institute at Carnegie Mellon University, and holds B.S (SCS, Univ. Honors), M.S degrees from the same institution. He is a Seibel Scholar and has received the annual Graduate Student Teaching Award at Carnegie Mellon. His research focus is on syntax augmented machine translation.
|
| 22 Sep 2006 | Eduard Hovy |
Toward a 'Science' of Annotation: Experiences from OntoNotes
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: As machine learning algorithms and their application for NLP become better understood, attention turns toward the production of annotated corpora to which they can be applied. Numerous phenomena present themselves for annotation, including aspects in lexical semantics, discourse, pragmatics, and dialogue. But several questions immediately must be answered: 1. How does one obtain a balanced corpus to annotate? What is a balanced corpus? 2. How does one decide which aspects to annotate? How does one adequately express the theory behind the phenomena in simple annotation steps? 3. Which annotators does one hire? How does one ensure that they are adequately trained? 4. How does one establish a simple, fast, and trustworthy annotation procedure? What interfaces does one build? How does one ensure that the interfaces do not affect the annotation results? 5. How does evaluate the results? What are the appropriate agreement measures? At which cutoff points should one re-do the annotations? How does one ensure improvement? 6. How should one formulate and store the results? How does one ensure compatibility with other existing resources? How does one make results available for best impact? 7. How does one report the annotation effort and results? How does one actually get a paper on this work published at an important conference? What should the paper contain?
Despite their being so basic, there is almost no established procedure
or standard set of answers to these questions today. In this talk I
discuss some of these aspects, pointing to the lessons learned in the
ongoing OntoNotes project (joint with BBN, the University of Colorado
(PropBank), the University of Pennsylvania (Treebank), and ISI).
|
| 25 Aug 2006 | Jason Riesa |
Minimally Supervised Morphological Segmentation with Applications to Machine Translation
Time: 3:30 pm - 4:00 pm Location: 11 Large Abstract: Inflected languages in a low-resource setting present a data sparsity problem for statistical machine translation. In this work, we present a minimally supervised algorithm for morpheme segmentation on Arabic dialects which reduces unknown words at translation time by over 50%, total vocabulary size by over 40%, and yields a significant increase in BLEU score over a previous state-of-the-art phrase-based statistical MT system. |
| 25 Aug 2006 | Victoria Fossum (Michigan) |
Improving Precision of Word Alignments Using GHKM Syntax-Based Rule Extraction
Time: 3:00 pm - 3:30 pm Location: 11 Large Abstract: Noisy word alignments negatively affect the quality of the translation rules extracted by the ISI syntax-based MT system. In the literature, alignment is typically treated as a separate process from subsequent stages in the MT pipeline. By contrast, we allow rule extraction to guide the alignment process. We present an unsupervised algorithm for identifying and removing "bad" links using GHKM syntax-based rule extraction. We show that we can improve upon the precision of GIZA union (measured against a gold standard set of manually aligned Chinese-English sentence pairs), while only decreasing recall slightly.
(Note: This is part of the Summer Intern Series)
|
| 23 Aug 2006 | Joseph Turian (NYU) |
Speeding-up Syntax-based Decoding
Time: 3:30 pm - 4:00 pm Location: 11 Large Abstract: TBA
(Note: This is part of the Summer Intern Series)
|
| 23 Aug 2006 | Oana-Diana Postolache |
Towards combining Searn and Syntax-Based Machine Translation (SBMT)
Time: 3:00 pm - 3:30 pm Location: 11 Large Abstract: This talk is about modeling the Syntax-Based Machine Translation (SBMT) problem within the Searn (Search & Learn) framework developed by Hal Daume in his PhD thesis. I will present the way we define the states, actions and the search space and how to implement the cost function.
(Note: This is part of the Summer Intern Series)
|
| 18 Aug 2006 | Chenhai Xi |
Name Entity Transliteration Discovery from Large Bilingual Comparable Corpora
Time: 3:00 pm - 3:30 pm Location: 11 Large Abstract: In this summer project, we investigate a scalable method to extract Chinese-English name transliterations from large comparable corpora, which consist of two languages discussing same or similar topics. We show that bigram Jaccard coefficient is a good similarity method to compare English and Chinese names, at Chinese pronunciation (Pinyin) level. Based on this phonetic similarity score, an efficient randomized algorithm is then used to find name pair candidates from English and Chinese lists. Finally, context information, such as dates, frequency, place and titles are combined with the phonetic similarity to improve the accuracy of the name pairs list.
(Note: This is part of the Summer Intern Series)
|
| 11 Aug 2006 | Idan Szpektor (Bar-Ilan U) |
Textual Entailment: Framework, Learning and Applications
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Textual Entailment has been proposed recently as a generic framework for modeling semantic variability in many Natural Language Processing applications, such as Question Answering, Information Extraction, Information Retrieval and Document Summarization. The Textual Entailment relationship holds between two text fragments, termed text and hypothesis, if the truth of the hypothesis can be inferred from the text. In this talk, the Textual Entailment framework will be introduced. I'll then present an algorithm for large-scale Web-based acquisition of entailment rules, a type of knowledge needed for robust inference. Finally, I will present an unsupervised Relation Extraction approach based on the Textual Entailment framework. About the speaker: Idan Szpektor is a PhD student under the supervision of Dr. Ido Dagan at Bar Ilan University, Israel. His current research activity is in acquisition of knowledge for textual entailment.
|
| 04 Aug 2006 | Shou-de Lin |
Ph.D. defense practice talk
Time: 3:30 pm - 4:30 pm Location: 11 Large Abstract: This is a practice talk for my Ph.D. defense, which will be held on Aug 24th 3-5pm, SAL 322. An important problem in the area of homeland security and fraud detection is to identify abnormal entities in large datasets. Although there are methods from knowledge discovery and data mining focusing on finding anomalies in numerical datasets, there has been little work aimed at discovering abnormal or suspicious instances in large and complex semantic graphs whose nodes are richly connected with many different types of links. In this talk, I will describe a novel, domain-independent and unsupervised framework to identify such instances. Besides discovering suspicious instances, we believe that to complete the discovery process and to deal with the "curse of false positives", a system has to convince the users by providing explanations for its findings. Therefore, in the second part of the talk I will describe an explanation mechanism to automatically generate human-understandable explanations for the discovered results. Experimental results show that our discovery system outperforms state-of-the-art unsupervised network algorithms used to analyze the 9/11 terrorist network by a large margin. Additionally, a human study we conducted demonstrates that our explanation system, which provides natural language explanations for its findings, allowed human subjects to perform complex data analysis in a much more efficient and accurate manner
|
| 28 Jul 2006 | Qin Iris Wang (Alberta) |
Improved Large Margin Dependency Parsing via Local Constraints and Laplacian Regularization
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: This talk is about an improved approach for learning dependency parsers from treebank data. Our technique is based on two ideas for improving large margin training in the context of dependency parsing. First, we incorporate local constraints that enforce the correctness of each individual link, rather than just scoring the global parse tree. Second, to cope with sparse data, we smooth the lexical parameters according to their underlying word similarities using Laplacian Regularization. To demonstrate the benefits of our approach, we consider the problem of parsing Chinese treebank data using only lexical features, that is, without part-of-speech tags or grammatical categories. We achieve state of the art performance, improving upon current large margin approaches. Here is the link for the paper: http://www.cs.ualberta.ca/~wqin/papers/depar_margin_conll06.pdf About the speaker:
Qin Iris Wang is a Ph.D. student from the University of Alberta,
working with Dekang Lin and Dale Schuurmans. Her research interests
are in natural language processing and machine learning. Specifically,
she has been working on dependency parsing using both generative and
discriminative methods.
|
| 11 Jul 2006 | Dragos Munteanu + Joseph Turian |
Practice Talks for ACL
Time: 2:30 pm - 4:00 pm Location: 11 Large Abstract: Extracting Parallel Sub-Sentential Fragments from Non-Parallel Corpora Dragos Munteanu We present a novel method for extracting parallel sub-sentential fragments from comparable bilingual corpora. Currently, the state of the art in comparable corpus mining is only able to extract full sentence pairs which are judged to be parallel. We advance the state of the art by showing how to obtain useful data even from not-fully-parallel sentences. By analyzing sentence pairs using a signal-processing-inspired approach, we detect which segments of the source sentence are translated into segments of the target sentence, and which are not. We evaluate the quality of the extracted data by showing that it improves the performance of a state-of-othe-art machine translation system.
Advances in Discriminative Parsing Joseph Turian
The present work advances the accuracy and training speed of
discriminative parsing. Our discriminative parsing method has no
generative component, yet surpasses a generative baseline on constituent
parsing, and does so with minimal linguistic cleverness. Our model can
incorporate arbitrary features of the input and parse state, and performs
feature selection incrementally over an exponential feature space during
training. We demonstrate the flexibility of our approach by testing it
with several parsing strategies and various feature sets.
|
| 30 Jun 2006 | David Chiang and Kevin Knight |
Synchronous Grammars and Tree Transducers
Time: 2:00 pm - 5:00 pm Location: 11 Large Abstract: (Practice tutorial for ACL/COLING 2006) Once upon a time, synchronous grammars and tree transducers were esoteric topics in formal language theory, far removed from the practice of building real, large-scale natural language systems. However, these tools are now rapidly becoming essential for modeling machine translation and other complex language transformations. It has therefore become practical and important to understand the basic properties of tree transformation systems, which we cover in this tutorial.
|
| 23 Jun 2006 | Joseph Turian (NYU) |
Discriminative Training for Large-Scale NLP
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Parsing and translating natural languages can be viewed as structured-prediction problems. We outline the crucial design decisions that must be made to build a machine to solve structured prediction problems, and explain our particular choices for these two large-scale NLP problems. Our approach uses a purely discriminative learning method that scales up well to problems of this size. Unlike currently popular methods, this one does not require a great deal of feature engineering a priori, because it performs feature selection over a compound feature space as it learns. Accuracy on constituent parsing was at least as good as other comparable methods. To our knowledge, it is the first purely discriminative learning algorithm for translation with tree-structured models. Experiments demonstrate the method's versatility, accuracy, and efficiency.
|
| 26 May 2006 | Radu Soricut and Hal Daume III |
Defense Practice Talks: Generation and Learning
Time: 3:00 pm - 5:00 pm Location: 11 Large Abstract: These are two practice talks for our upcoming thesis defenses. The titles and abstracts are: -------------------------------------------------------------------------- NATURAL LANGUAGE GENERATION FOR TEXT-TO-TEXT APPLICATIONS USING AN INFORMATION-SLIM REPRESENTATION Radu Soricut In this talk, I describe a new natural language generation paradigm, based on direct transformation of textual information into well-formed textual output. I support this language generation paradigm with theoretical contributions in the field of formal languages, new algorithms, empirical results, and software implementations. At the core of this work is a novel representation formalism for probability distributions over finite languages. Due to its convenient representation and computational properties, this formalism supports a wide range of language generation needs, from sentence realization to text planning. Based on this formalism, I describe, implement, and analyze theoretically a family of algorithms that perform language generation using direct transformations of text. These algorithms use stochastic models of language to drive the generation process. I perform extensive empirical evaluations using my implementation of these algorithms. These evaluations show state-of-the-art performance in automatic translation, and significant improvements in state-of-the-art performance in abstractive headline generation and coherent discourse generation.
-------------------------------------------------------------------------- PRACTICAL STRUCTURED LEARNING FOR NATURAL LANGUAGE PROCESSING Hal Daume III
Natural language processing is replete with problems whose outputs are
highly complex and structured. The current state-of-the-art in machine
learning is not yet sufficiently general to be applied to general problems
in NLP. In this thesis, I present Searn (for "search" + "learn"), an
approach to learning for structured outputs that is applicable to the wide
variety of problems encountered in natural language. Searn operates by
transforming structured prediction problems into a collection of
classification problems, to which any standard binary classifier may be
applied. From a theoretical perspective, Searn satisfies a strong
fundamental performance guarantee: given a good classification algorithm,
Searn yields a good structured prediction algorithm. To demonstrate
Searn's general applicability, I present applications in such diverse
areas as automatic document summarization and entity detection and
tracking. In these applications, Searn is empirically shown to achieve
state-of-the-art performance.
|
| 24 May 2006 | Hal Daume III |
Beyond EM: Bayesian Techniques for Human Language Technology Researchers
Time: 9:00 am - 12:00 pm Location: 4th Floor Abstract: This is a practice tutorial for one I am giving at HLT/NAACL one week later. Comments/feedback are very welcome. ---------------------------------------------------------------------- Expectation Maximization (EM) has proved to be a great and useful technique for unsupervised learning problems in speech and language processing. Unfortunately, its range of applications is limited either by intractable E- or M-steps, or by its reliance on the maximum likelihood estimator. The natural language processing community typically resorts to ad-hoc approximation methods to get (some reduced form of) EM to apply to NLP tasks. However, many of the problems that plague EM can be solved with Bayesian methods, which are theoretically more well justified. In this tutorial, I discuss Bayesian methods as they can be used in natural language processing. The two primary foci of this tutorial are specifying prior distributions and performing the necessary computations to perform inference in Bayesian models. I focus on unsupervised techniques (for which EM is the obvious choice), but discuss supervised and discriminative techniques at the conclusion with pointers to relevant literature. Depending on one's inference technique of choice, the math required to build Bayesian learning models can be difficult. Compounding this problem is the fact that current written tutorials on Bayesian techniques tend to focus on continuous-valued problems, a poor match for the high-dimension discrete world of text. This combination makes the cost of entrance to the Bayesian learning literature often too high. The goal of this tutorial is to provide sufficient motivation, intuition and vocabulary mapping so that one can easily understand recent papers in Bayesian learning that are published at conferences like NIPS, and increasingly at ACL. In addition to the standard tutorial materials (slides), this tutorial is accompanied by a technical report that spells out all the mathematic derivations in great detail, for those who wish to start research projects in this fields.
This tutorial should be accessible to anyone with a basic understanding of
statistics. I use a query-focused summarization task as a motivating
running example for the tutorial, which should be of interest to
researchers in natural language processing and in information retrieval.
Additionally, though the tutorial does not focus on speech problems, those
attendees interested in graphical modeling techniques for automatic speech
recognition might also find the tutorial of interest.
|
| 19 May 2006 | Patrick Pantel |
Espresso: Making Use of Generic Patterns for Mining Relations from Small and Large Corpora
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: In the past decade, researchers have explored many approaches to automatically extract large collections of knowledge from text. In this talk, we present Espresso, a weakly-supervised, general-purpose, and broad-coverage algorithm for harvesting binary semantic relations. The main contributions are: i) a method for exploiting generic patterns by filtering incorrect instances using the Web; and ii) a principled measure of pattern and instance reliability enabling the filtering algorithm. We present an empirical comparison of Espresso with various state of the art systems, on different size and genre corpora, on extracting various general and specific relations. Experimental results show that our exploitation of generic patterns substantially increases system recall with small effect on overall precision.
|
| 12 May 2006 | Nick Mote and Donghui Feng |
Pedagogical Contextualization of Language Learner Speech Errors AND Learning to Detect Conversation Focus of Threaded Discussions
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: This is two practice talks. ----------------------------------------------------------------------------- FIRST TALK: The traditional approach to diagnosing learner speech errors in Computer Aided Language Learning is to create a linguistic profile of the learner/user. We, however, propose that work must also be done to model the linguistic profile of a typcial native listener. Not all errors in second langage learner speech are created equal. Different errors sound more "severe" or "harsh" to native speaker ears and should therefore be treated with more emphasis in pedagogical interaction. The Tactical Language Training System (TLTS) is a speech-enabled virtual-reality based computer learning environment designed to teach Arabic spoken communication to American English speakers. This talk addresses the ways the TLTS contextualizes non-native speech errors, and how this contextualization fits in the corrective exchanges between a non-native learner and a pedagogical agent built to model a native listener. The pedagogical system used in TLTS includes: * Automatic Speech Recognition (ASR) models which are built on a combination of both annnotated and unannotated non-native speech with native speech data. * A stochastic generative model for errors in learner speech that creates mispronunciation grammars for the ASR * Reweighting of system-perceived mispronunciation severity based on aggregate native speaker judgements of quality pronunciation and intelligiblity. * Contextualization of feedback based on lexical and phonetic inventories of the native and non-native languages.
----------------------------------------------------------------------------- SECOND TALK:
We present a novel feature-enriched approach that learns to detect the
conversation focus of threaded discussions by combining NLP analysis and
IR techniques. Using the graph-based algorithm HITS, we integrate
different features such as lexical similarity, poster trustworthiness, and
speech act analysis of human conversations with featureoriented link
generation functions. It is the first quantitative study to analyze human
conversation focus in the context of online discussions that takes into
account heterogeneous sources of evidence. Experimental results using a
threaded discussion corpus from an undergraduate class show that it
achieves significant performance improvements compared with the baseline
system.
|
| 05 May 2006 | Namhee Kwon |
Recognizing Argument Structures in Texts
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: I present our approach to identify an argument structure defined as a simple hierarchical structure of claim and reasons. The claim is also classified into "in favor of" or "against" the topic. The experiment is performed on the comments from the general public sent to government officials in response to proposed regulations.
|
| 28 Apr 2006 | Feng Pan |
Learning Event Durations from Event Descriptions
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: The research of extracting event duration information from texts is potentially very important in applications in which the time course of events is to be extracted from news. For example, whether two events overlap or are in sequence often depends very much on their durations. If a war started yesterday, we can be pretty sure it is still going on today. If a hurricane started last year, we can be sure it is over by now. In the talk, I will first present our work on constructing an annotated corpus for extracting information about the typical durations of events from texts, including the annotation guidelines, the event classes we categorized, the way we use normal distributions to model such vague and implicit temporal information, and how we evaluate inter-annotator agreement. I will then show that machine learning techniques applied to this data yield coarse-grained event duration information, considerably outperforming a baseline and approaching human performance. At the beginning of the talk, I will also give a brief overview of the time ontology (OWL-Time, formerly DAML-Time) we have developed, which is represented in both first-order logic and the OWL web ontology language.
|
| 21 Apr 2006 | Soo-Min Kim |
Identifying and Analyzing Judgment Opinions
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: In this talk, we introduce a methodology for analyzing judgment opinions. We define a judgment opinion as consisting of a valence, a holder, and a topic. We decompose the task of opinion analysis into four parts: 1) recognizing the opinion; 2) identifying the valence; 3) identifying the holder; and 4) identifying the topic. We evaluate our methodology using both intrinsic and extrinsic measures. |
| 14 Apr 2006 | Radu Soricut |
Natural Language Generation for Text-to-Text Applications using an Information-Slim Representation
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Although a considerable number of generic Natural Language Generation (NLG) systems has been produced over the years, none of them is usually employed in end-to-end, text-to-text applications such as Machine Translation, Summarization, Question Answering, etc. In this talk, we identify the likely reasons for this state of affairs, and propose WIDL-expressions as a flexible formalism that facilitates the integration of a generic NLG engine within end-to-end language processing applications. WIDL-expressions represent compactly probability distributions over finite sets of candidate realizations, and have optimal algorithms for text realization via interpolation with language model probability distributions. We show the effectiveness of our WIDL-based NLG engine for both sentence realization and document realization tasks. By employing language models that capture sentence-level properties, we perform Machine Translation and Headline Generation at state-of-the-art levels or better. By employing language models that capture document-level properties such as text coherence, we synthesize output for Multi-document Summarization that displays both high content selection performance and increased coherence.
|
| 24 Mar 2006 | Dragos Munteanu |
Automatic creation of parallel corpora
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Parallel texts -- texts that are translations of each other -- are an important resource in many cross-lingual NLP applications, such as lexical acquisition, cross-language IR, and annotation projection. However, their importance is paramount for Statistical Machine Translation (SMT), as they provide the training data from which all the translation knowledge is learned. The state of the art in SMT is advanced enough that, given sufficient parallel data (i.e. a few million words) for any language pair in a given domain, a generic SMT system trained on it will achieve a reasonable translation performance in that domain. The main reason why SMT systems exist only for a handful of languages is that, for most language pairs, parallel training data is simply not available. One way to alleviate this lack of parallel data is to exploit a much richer and more diverse resource: comparable corpora, texts which are not strictly parallel but related. The prototypical example of comparable texts are two news articles in different languages which report on the same event. I will present methods for automatic extraction of parallel data from such corpora. I will show how to detect parallel data at various levels of granularity: parallel documents, parallel sentences, and even parallel sub-sentence fragments. The parallel corpora obtained using these methods help improve translation performance for both resource-scarce language pairs (such as Romanian-English) and resource-rich ones (such as Arabic-English).
|
| 17 Mar 2006 | Jon May |
Tiburon: A Finite State Tree Automata Toolkit
Time: 3:00 pm - 4:30 pm Location: 4th Floor Abstract: In the 1990s, researchers applied their new developments in transducer theory using widely available easy-to-use toolkits for string transducers, and made well-known advances in parsing, machine translation, and other areas. Rapid prototyping via software such as the AT&T toolkit and carmel was useful for proofs of concept and in many cases led to unforseen developments in novel areas. In the current nlp research environment tree based strategies and new models have shown promising results in advancing the state of the art, and recent developments in weighted tree automata theory are enriching the bedrock created 40 years ago, but as of yet there is no toolkit available with the necessary capabilities to turn promise into solution.
Tiburon is the first probablistic tree transducer toolkit. Similar in form
and function to the string-based toolkits of yesteryear, it is designed to
be easy to use, with simple but expressive definitions of tree automata
and a concise set of vital operations that can be used to construct many
useful tree-based nlp projects. Although a work in progress, Tiburon is
already a usable tool with active users between the ages of 6 and 41. I
will describe the current status of the system, demonstrate ease of use
and potential power, and discuss the challenges ahead.
|
| 10 Mar 2006 | Mark Hopkins |
Exploring the Potential of Intractable Parsers
Time: 3:00 pm - 4:30 pm Location: 10th Floor Abstract: We revisit the idea of history-based parsing, and present a history-based parsing framework that strives to be simple, general, and flexible. We also provide a decoder for this probability model that is linear-space, optimal, and anytime. A parser based on this framework, when evaluated on Section 23 of the Penn Treebank, compares favorably with other state-of-the-art approaches, in terms of both accuracy and speed.
|
| 03 Mar 2006 | Liang Huang (Penn) |
Syntax-Directed Translation with Extended Domain of Locality
Time: 3:00 pm - 4:30 pm Location: 11th Floor (Large) Abstract: (note: this is a very tentative title -- comments welcome!) We present a novel extension of syntax-directed translation for statistical MT. Formally speaking, our model is based on tree-to- string transducers that recursively convert a parse-tree in the source-language into a string in the target-language. These transduction rules have multi-level trees on the source-side, giving this system more transformational power due to the extended domain of locality. We also present efficient algorithms for decoding based on dynamic programming. Initial experiments on English-to-Chinese translation show promising results in both speed and the translation quality. Joint work with Kevin Knight and Aravind Joshi. Bio:
Liang Huang is a 3rd-year PhD student from the University of Pennsylvania.
He is mainly interested in algorithms and formalisms for parsing and
syntax-based machine translation. His recent work has been on k-best
parsing algorithms (with David Chiang) and synchronous binarization for MT
(with Hao Zhang, Dan Gildea, and Kevin Knight).
|
| 24 Feb 2006 | Hal Daume III |
Search-based Structured Prediction
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: I present an algorithm, Searn (for "search-learn") that is designed to solve structured prediction problem: problems whose goal is to learn to predict complex objects such as parts-of-speech, parse trees, translations, etc... Searn functions by "breaking apart" structured prediction problems into classification problems in the process of search. I analyze Searn in the framework of learning reductions and show that good performance on the underlying classification problems implies good search performance. Moreover, Searn is computationally efficient in a superset of the settings where previous algorithms are efficient and is not limited by conditional independence assumptions (as in CRFs). This excessively simple and general algorithm turns out to have excellent state-of-the-art performance.
This is joint work with John Langford (TTI-C) and Daniel Marcu; and, to a
lesser extent, with Drew Bagnell (CMU) and Bianca Zadrozny (IBM TJ
Watson).
|
| 10 Feb 2006 | David Chiang |
Parsing Arabic Dialects
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: The Arabic language exhibits diglossia, i.e., the coexistence of two forms of language, a variety with standard orthography and sociopolitical clout which is not natively spoken by anyone (Modern Standard Arabic, MSA) and varieties that are primarily spoken and lack writing standards (Arabic dialects). There are important resources currently available for MSA with much on-going NLP work; for example, there is an Arabic Treebank and several syntactic parsers for MSA. However, Arabic dialect resources and NLP research are still at an infancy stage. I will present work done at the Johns Hopkins CLSP Summer Workshop on parsing of Arabic dialects, in particular, Levantine Arabic. We have experimented with three approaches to leveraging MSA resources to create a parser for Levantine Arabic, as well as methods for induction of MSA-Levantine translation lexicons and a Levantine part-of-speech tagger. Using these methods we obtain error reductions of up to 15% compared with applying an MSA parser directly to Levantine text. Rambow et al. Parsing Arabic Dialects: Final Report. Johns Hopkins University Center for Language and Speech Processing Workshop 2005. http://www.clsp.jhu.edu/ws2005/groups/arabic/documents/finalreport.pdf Chiang et al. Parsing Arabic Dialects. To appear in Proc. EACL 2006.
This is joint work with O. Rambow, M. Diab, N. Habash, R. Hwa, K. Sima'an,
V. Lacey, R. Levy, C. Nichols and S. Shareef.
|
| 03 Feb 2006 | Alex Fraser |
Measuring Word Alignment Quality for Statistical Machine Translation
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Automatic word alignment plays a critical role in statistical machine translation. Unfortunately the relationship between alignment quality and statistical machine translation performance has not been well understood. In the recent literature the alignment task has frequently been decoupled from the translation task, and assumptions have been made about measuring alignment quality for machine translation which, it turns out, are not justified. In particular, none of the tens of papers published over the last five years has shown that significant decreases in Alignment Error Rate (AER) result in significant increases in translation quality. I will explain this state of affairs and present steps towards measuring alignment quality in a way which is predictive of statistical machine translation quality. I will also provide a brief overview of some of my other work on training and search for word alignment.
|
| 27 Jan 2006 | John Conroy |
Multi-Document Summary Space:What do People Agree is Important?
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: A multi-document summary gives the "gist" of what is contained in a collection of related documents. But how can we define a "gist?" We explore this question by analyzing human written summaries for clusters of document sets. In particular, we estimate the probability that word will be chosen by a human to be included in a summary. We demonstrate that if this probability model were given by an oracle, then a simple automatic method of summarization can produce extract summaries which are statistically indistinguishable from the human summaries. About the Speaker: John M. Conroy received a B.S. in Mathematics from Saint Joseph's University in 1980 and a Ph.D. in Applied Mathematics from the University of Maryland in 1986. Since then he has been a research staff member for the IDA Center for Computing Sciences in Bowie, MD. His research interest is applications of numerical linear algebra and statistics. He is a member of the Society for Industrial and Applied Mathematics, Institute of Electrical and Electronics Engineers (IEEE), and the Association for Computational Linguistics.
|
| 26 Jan 2006 | Tim Chklovski |
GrainPile: Deriving Quantitative Overviews of Free Text Assessments on the Web
Time: 1:00 pm - 2:00 pm Location: 4th floor Abstract: Many research efforts are addressing the problem of enabling automatic summarization of opinions and assessments stated on the web in product reviews, discussion forums, and blogs. One key difficulty is that relevant assessments scattered throughout web pages are obscured by variations in natural language. In this paper, we focus on a novel aspect of enabling aggregations of assessments of degree to which a given property holds for a given entity (for instance, how touristy is Boston). We present GrainPile, a user interface for extracting from the web, aggregating and quantifying degree assessments of unconstrained topics. The interface provides a variety of functions: a) identification of dimensions of comparison (properties) relevant to a particular entity or set of entities, b) comparisons of like entities on user-specified properties (for example, which university is more prestigious, Yale or Cornell), c) tracing the derived opinions back to their sources (so that the reasons for the opinions can be found). A central contribution in GrainPile is the evaluated demonstration of feasibility of mapping the recognized expressions (such as fairly, very, extremely, and so on) to a common scale of numerical values and aggregating across all the extracted assessments to derive an overall assessment of degree. GrainPile’s novel assessment and aggregation of degree expressions is shown to strongly outperform an interpretation-free, co-occurrence based method. Full paper: http://www.isi.edu/~timc/papers/IUI06-grainpile-chkl.pdf
|
| 16 Dec 2005 | Jonathan May |
A Better N-Best List - Practical Determinization of Weighted Finite Tree Automata
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Ranked lists of output trees from syntactic statistical NLP applications frequently contain multiple repeated entries. This redundancy leads to misrepresentation of tree weight and reduced information for debugging and tuning purposes. It is chiefly due to nondeterminism in the weighted automata that produce the results. I will introduce an algorithm that determinizes such automata while preserving proper weights, returning the sum of the weight of all multiply derived trees. I will also report results of the application of the algorithm to machine translation and Data Oriented Parsing.
|
| 30 Sep 2005 | David Chiang |
Some Computational Complexity Results for Synchronous Context-Free Grammars
Time: 3:00 pm - 4:30 pm Location: 4 Large Abstract: (This is a practice talk for a paper by Giorgio Satta and Enoch Peserico) This paper investigates some computational problems associated with probabilistic translation models that have recently been adopted in the literature on machine translation. These models can be viewed as pairs of probabilistic context-free grammars working in a `synchronous' way. Two hardness results for the class NP are reported, along with an exponential time lower-bound for certain classes of algorithms that are currently used in the literature.
|
| 29 Sep 2005 | Tim Chklovski |
Previews of my talks for K-CAP
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: The topics & approximate start times: (3:00 sharp) My 7-10 min bit for panel discussion on "Manual vs. Automated Knowledge Acquisition" Will touch on web extraction vs. learning from volunteers -- strengths and weaknesses, new thoughts on synergies (3:15) Designing Intelligent Acquisition Interfaces for Collecting World Knowledge from Web Contributors (paper by Timothy Chklovski, Yolanda Gil)
(3:55) Collecting Paraphrase Corpora from Volunteer Contributors (paper by
Timothy Chklovski)
|
| 26 Aug 2005 | Fossum, Huang and Zhang |
Summer Student Presentations
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: 3:00pm Victoria Fossum (Michigan) Exploring the Continuum between Phrase-based and Syntax-based Machine Translation State-of-the-art statistical machine translation systems use lexical phrases as the basic unit of translation. Phrase-based systems can capture those aspects of translation that are sensitive to local context. Syntax-based systems, on the other hand, make use of linguistically motivated syntactic structure, can capture long-distance dependencies and reorderings, and offer greater generalization in translation rules. However, their performance lags that of phrase-based systems. Hierarchical phrase-based translation, introduced by [Chiang 05], provides an elegant framework for exploring the continuum between phrase-based and syntax-based translation. This system combines the "formal machinery" of syntax-based systems without any "linguistic commitment" to a particular syntactic structure [Chiang 05]. I will present results from my re-implementation of Chiang's hierarchical phrase-based system, and (if time permits) compare those results with the following systems on Chinese-English translation: ISI's phrase-based system, and ISI's syntax-based system. Between now and December 2005, I plan to incrementally explore the space between phrase-based and syntax-based systems by augmenting these hierarchical phrase-based rules with richer syntactic annotation.
3:30pm Liang Huang (Penn) and Hao Zhang (Rochester) Efficient Integration of n-gram Language Models with Syntax-based Decoding We first give an overview of the ISI syntax-based MT system which is based on tree-to-string (xRs) translation rules. The biggest problem at this stage is the inefficiency of the integration of n-gram models. Without n-gram models, the xRs translation rules can be easily binarized with respect to the foreign language to ensure cubic-time decoding. With n-gram models, however, binarization without considering both languages will lead to exponential complexity.
Inspired by Inversion Transduction Grammar (ITG) (Wu, 97), we will focus
on the so-called ITG binarizable rules which count for over 99% of the
whole rule set. A simple linear-time algorithm will be presented to do the
binarization. Decoding with ITG-like rules is of low polynomial complexity
in both time and space. We will discuss experimental results on both
efficiency and accuracy of decoding with the new binarization. If time
permits, we will also present the "hook trick" (inspired by (Eisner and
Satta, 99)) to even further reduce the polynomial complexity of the
decoding process.
|
| 24 Aug 2005 | Hopkins, Riesa, and Nakov |
Summer Student Presentations
Time: 3:30 pm - 5:00 pm Location: 11 Large Abstract: 3:30pm Mark Hopkins (UCLA) Tree Sequence Automata: A Unifying Framework for Tree Relation Formalisms There exist a wide variety of competing formalisms for representing a language of ordered tree pairs. These include (bottom-up and top-down) tree transducers, synchronous tree-substitution grammars (STSGs), synchronous tree-adjoining grammars (STAGs), and inversion transduction grammars (ITGs). Since these formalisms have all developed independently of one another, it is difficult to compare their respective representational power. This work seeks to make this task simpler by viewing these formalisms as instances of a general unifying formalism, which we call tree sequence automata (TSA). By casting these different formalisms in a single framework, we can compare them directly by studying the specific subclass of TSA that they fall into. 4:00pm Jason Riesa (Johns Hopkins) A case study in building a cost-effective speech-to-speech machine translation system with sparse resources: English - Iraqi Arabic The Arabic spoken dialect of Iraq is a language deprived of the vast resources that researchers enjoy when working with its written counterpart, Modern Standard Arabic (MSA). The Iraqi Arabic lexicon and grammar are also sufficiently distinct so that the use of existing tools or corpora for MSA yield little or no positive effect on machine translation output quality. One can see that building a machine translation system normally dependent on a large parallel corpus is a particularly difficult task when given just a 37,000 line translated parallel text based on transcribed speech. This talk will explore the constraints involved in working with this type of data, how we endeavored to mitigate such problems as a non-standard orthography and a highly inflected grammar, and propose a cost- effective way for dealing with such projects in the future. 4:30pm Preslav Nakov (UC Berkeley) Multilingual Word Alignment
Recently there has been a growing number of available multilingual
parallel texts. One such source is the European Union, which publishes its
official documents in the official languages of all member states
(sometimes also in the languages of the candidates). Another source are
the United Nations. These corpora are a great source of training data for
machine translation between new language pairs. But they also offer the
opportunity to obtain better pairwise word alignments by looking at
multiple languages in parallel. In this talk I will present my research as
a summer intern at ISI on getting better French (Fr) to English (En) word
alignments using an additional language (Xx). First, I will introduce two
heuristics which start with pairwise alignments between Fr-Xx, En-Xx and
Fr-En and then combine them probabilistically (in a linear model) or
graph-theoretically (by looking at in- and out-degrees for each word).
Then I will present two Model1 inspired alignment models: (a) from "Fr and
Xx" to En; and (b) from Fr to "En and Xx".
|
| 05 Aug 2005 | Jan Hajic (Charles U) |
The Family of Prague Dependency Treebanks
Time: 10:30 am - 12:00 pm Location: 11 Large Abstract: The Prague Dependency Treebank project is aimed at a linguistically complex, multi-tier annotation of relatively large amounts of naturally occuring sentences of natural language. There are four tiers at present: the basic token tier (level 0), and the morphological, surface-syntacic, and semantic (called "tectogrammatics") tiers. The syntactic and tectogrammatic tiers are based on a richly labelled dependency representation principle. So far, the project produced three corpora: the Czech-language-only Prague Dependency Treebank, the Prague Czech-English Dependency Treebank and the Prague Arabic Dependency Treebank. In the talk, the principles of the Prague Dependency Treebank linguistic annotation scheme will be presented. Some technical details will also be discussed, as well as some of the tools developed both for the manual annotation itself and for corpus-based NLP of Czech, English and Arabic.
|
| 05 Aug 2005 | Doug Oard (Maryland) |
The CLEF Cross-Language Speech Retrieval Test Collection
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Test collections for information retrieval tasks have traditionally assumed that what we are searching for are documents (e.g., Web pages, news stories, or academic documents). Most information that is generated is, however, not in originally generated as part of a document, but rather as what we might refer to as "conversational media" (e.g., email, speech, or instant messaging). In this talk, I'll describe the creation of two test collections for conversational media, an email collection being created in the TREC Enterprise Search track and a spoken word test collection for the the Cross-Language Evaluation Forum (CLEF). I'll spend most of the talk describing the details of the CLEF test collection, illustrating the issues with some of the results that we have obtained from our experiments with that collection. I'll conclude with a few remarks about the implications of what we are learning for DARPA's new GALE program. This is joint work with Charles University, the IBM TJ Watson Research Center, the Johns Hopkins University, the Survivors of the Shoah Visual History Foundation, and the University of West Bohemia.
About the speaker:
Douglas Oard is an Associate Professor at the University of Maryland,
College Park, with a joint appointment in the College of Information
Studies and the Institute for Advanced Computer Studies. He holds a Ph.D.
in Electrical Engineering from the University of Maryland, and his
research interests center around the use of emerging technologies to
support information seeking by end users. In 2002 and 2003, Doug spent a
year in paradise here at USC-ISI. His recent work has focused on
interactive techniques for cross-language information retrieval and on
searching conversational text and speech. Additional information is
available at http://www.glue.umd.edu/~oard/.
|
| 15 Jul 2005 | Victoria Li Fossum (Michigan) |
Inducing POS Taggers by Projecting from Multiple Source Languages
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: (Yarowsky et al., 2001) present an algorithm for bootstrapping a POS tagger for an arbitrary target language, using an existing POS tagger for a source language and a parallel corpus in the source and target languages. The source text is annotated with the POS tagger; the parallel corpus is word-aligned; the POS tags are "projected" from source to target language; and finally smoothing is performed before training a POS tagger for the target language on the projected annotations.
I will talk about my work (jointly with my advisor, Steve Abney, at U. of
Michigan) in which we extend this algorithm by projecting from multiple
source languages onto a target language, then combining the outputs to
compute a consensus POS tagger. Our hypothesis is that systematic
transfer errors from different source-target pairs can be reduced by using
multiple source languages. I will present experimental results for three
different source languages (English, German, and Spanish), and two
different target languages (French and Czech). Our results indicate that
using multiple source languages improves performance.
|
| 07 Jul 2005 | Radu Soricut |
Natural Language Generation for Text-to-Text Applications Using an Information-Slim Representation
Time: 3:00 pm - 4:30 pm Location: 11 Small Abstract: Text-to-text applications -- Machine Translation, Summarization, Question Answering -- do not usually involve generic Natural Language Generation (NLG) systems in their generation components, but rather use application-specific algorithms. The main reason for this state of affairs is that virtually all the formalisms used by current generic NLG systems require information that cannot be reliably extracted from unrestricted text. This thesis proposal is about meeting the demand for natural language generation in the context of text-to-text applications. I introduce a new representation formalism (WIDL-expressions), propose generation algorithms that operate on representations specific to this formalism, and discuss a generic sentence realization framework for text-to-text applications. The generation mechanism is based on algorithms for intersecting WIDL-expressions with probabilistic language models. I present both theoretical and empirical results concerning the correctness and efficiency of these algorithms. I also discuss the practical aspects arising from implementing this generation mechanism. In a concrete application of the proposed generation mechanisms, I present an end-to-end Machine Translation application. I also discuss another possible application for Automated Summarization, namely automated headline generation.
|
| 06 Jul 2005 | Alessandro Moschitti (Rome) |
Kernel Methods for Semantic Role Labeling
Time: 2:00 pm - 3:30 pm Location: 11 Large Abstract: Automatic Natural Language applications often require the processing of structured data. Traditional machine learning approaches attempt to represent structured syntactic/semantic objects by means of flat feature representations, i.e. attribute-value vectors. This raises two problems: 1. There is no well defined theoretical motivation for such feature model. Structural properties may not fit in any flat feature representation. 2. To define effective flat features, a deep knowledge about the linguistic phenomenon is required. Kernel methods for Natural Language Processing aim to solve both the above problems as kernel functions can be used to define similarities between linguistic objects without explicitly defining the target feature space. In this way, a linguistic phenomenon can be modeled at a more abstract level where the modeling is easier. Such property is extremely useful when the representation of linguistic phenomena is still not well understood. For example, the feature design of semantic role labeling appear to be quite complex since several and non-definitive feature sets have been proposed. As a viable alternative to manual feature design, kernel methods propose two steps: (1) they generate all substructures of the target syntactic/semantic structures and (2) they let the learning algorithm (e.g. Support Vector Machines) to select the most relevant substructures. In this talk, we (1) introduce the PropBank and FrameNet predicate argument structures, (2) present the standard approaches to the automatic labeling of semantic roles and (3) show advanced semantic role labeling models based on kernel methods. About the speaker: Alessandro Moschitti is a researcher at the Computer Science Department of the University of Rome ^ÓTor Vergata^Ô. In 1998 he took his master degree in Computer Science at the University of Rome ^ÓLa Sapienza^Ô. In 2003 he finished his PhD in Computer Science at ^ÓTor Vergata^Ô University. Between 2002 and 2004 he worked as an associate researcher in the University of Texas at Dallas. His research interests concern machine learning approaches for Natural Language Processing and Information Retrieval. His deep expertise relates to automated text categorization and semantic role labeling. Recently, he has devised new kernels which enable Support Vector and other kernel-based machines to carry out advanced semantic processing.
|
| 23 Jun 2005 | Michael Fleischman (MIT) |
Intentional Context in Situated Language Learning
Time: 10:30 am - 12:00 pm Location: 11 Small Abstract: Natural language interfaces designed for agents that interact with users in shared environments (e.g. training simulators, videogames) must incorporate knowledge about the users' context in order to address the many ambiguities of situated language use. We introduce a model of situated language acquisition that operates in two phases. First, intentional context is represented and inferred from user actions using probabilistic context free grammars. Then, utterances are mapped onto this representation in a noisy channel framework. The acquisition model is trained on unconstrained speech collected from subjects playing an interactive game, and tested using an understanding task. Discussion of results focuses both on the implications for theoretical models of cognition, as well as, for natural language applications in shared environments.
|
| 22 Jun 2005 | Mitsunori Matsushita |
Lumisight Table: A Face-to-face Collaboration Support System That Optimizes Direction of Projected Information to Each Stakeholder
Time: 11:00 am - 12:00 pm Location: 11 Large Abstract: (This talk occurs in the morning on the same day as the Bayesian tutorial.) The goal of our research is to support cooperative work performed by stakeholders sitting around a table. To support such cooperation, various table-based systems with a shared electronic display on the tabletop have been developed. These systems, however, suffer the common problem of not recognizing shared information such as text and images equally because the orientation of their view angle is not favorable. To solve this problem, we propose the Lumisight Table. This is a system capable of displaying personalized information to each required direction on one horizontal screen simultaneously by multiplexing them and of capturing stakeholders' gestures to manipulate the information. About the Speaker: Mitsunori Matsushita is a research scientist of NTT Communication Science Labs., Nippon Telegraph and Telephone Corporation (NTT). He received B.E., M.E., and Dr.E. degrees from Osaka University, in 1993, 1995 and 2003 respectively. In 1995, he joined NTT, and has been engaged in researches on natural language understanding, information visualization, and interaction design.
|
| 22 Jun 2005 | Hal Daume III |
Beyond EM: Bayesian Techniques for NLP Researchers
Time: 1:00 pm - 4:30 pm Location: 11 Large Abstract: EM has proved to be a great and useful technique for unsupervised learning problems in natural language. Unfortunately, it cannot solve every problem out there, either because the E-step is intractable, the M-step is intractable or both. Typically our community resorts to a Viterbi approximation in this case, which really isn't very justified and can easily diverge from our expectations (no pun intended). Moreover, EM -- like all maximum likelihood methods -- suffers from a need for ad-hoc and undesirable smoothing. All of these problems -- intractable E- or M-steps, the Viterbi approximation, and the annoyance of smoothing -- are solved by using Bayesian methods. Moreover, from a theoretic point of view, the Bayesian paradigm is much more foundationally well justified than the frequentist use of estimators (such as the maximum likelihood estimator), at some cost in computation (though not as much as you might believe). In this tutorial, I will discuss Bayesian methods as they can be used in natural language processing. The first half will be background (some of which you probably won't have seen, some of which you probably will have seen, but which will probably be presented in a different way that you're used to) including graphical models, EM, priors and pro- (and con-) Bayesian arguments. The second half of the tutorial will focus on solving complex inference problems, essentially building on what we've seen from EM. I'll cover MAP (*not* Bayesian -- if you can't tell me why, then you should come to the tutorial!), summing, Monte Carlo, MCMC, Laplace, variational and expectation propagation. Time permitting, I will briefly discuss Bayesian discriminative models (basically what a Bayesian uses instead of SVMs), non-parametric (infinite) models and Bayesian decision theory, all of which make use of the inference techniques we will have already covered. This tutorial is intended to be largely self contained, though I will expect that you know what probabilities are, what distributions are and the standard manipulations of conditional/joint distributions. Familiarity with EM would be helpful, but I'll cover this topic in some depth since it will be important for understanding the rest of the tutorial. I hope -- though this never really seems to come to fruition -- that this will be a semi-interactive talk and I will attempt to adjust according to what people are interested in and what is putting people to sleep. (see http://www.isi.edu/~hdaume/bayesnlp/ for more information)
|
| 20 Jun 2005 | Birte Loenneker (Hamburg) |
Between Story Generation and Natural Language Generation
Time: 10:00 am - 11:30 am Location: 11 Small Abstract: Narratology analyzes the discursive structure of narratives as finalized products of human invention, such as novels, short-stories, or fairy-tales. Those narratives are rendered in a given surface form; Narratology focuses on narratives in natural language. Narratologists assume that each narrative surface representation is associated with a neutral, abstract event sequence, the "Story" (histoire, sjuzhet). The abstractness of Story is illustrated by the fact that the same Story can be realized in different surface texts. By discursive structure or "Discourse" (discours, fabula), narralogists mean the relation between an abstract Story and its concrete expression in a sequential text. For example, if the chronological order of the Story is not respected in its textual recount, we are dealing with the Discourse parameter of order. Other Discourse parameters include the frequency with which Story events are evoked, the point of view from which they are narrated (perceived, evaluated,...), or framed narratives with several narrative levels. The Story Generator Algorithms project at the University of Hamburg evaluated several existing Story Generators with respect to their discursive abilities. It became obvious that most Story Generators concentrate on creating a coherent and chronological abstract Story, which is directly mapped onto natural language. This results in a predominance of 1:1 relations between Story and surface, and in most cases corresponds to a default or zero instantiation of Discourse parameters. As a consequence, Story Generator outputs tend to be very explicit and straightforward, and are likely to be perceived as uniform and boring. Narratological expert knowledge might be useful to future enhanced Story Generators and to Natural Language Generation systems dealing with narrative. One of the aims of Computational Narratology is to model that expert knowledge. Ideally, narratological knowledge will be integrated into a Narratological Structurer, as a processing component of an advanced system that creates narratives. In such a system, the Narratological Structurer will be the interface between a Story Generator and subsequent Natural Language Generation modules. The talk also presents examples of the knowledge that is being modelled.
About the Speaker: Birte Lönneker graduated from the University of Hamburg, Germany, with a degree in French with Finno-Ugristics (Finnish) and Business Administration. Since then, her main fields of publication are Cognitive Linguistics and electronic resources for Natural Language Processing, with special focus on frames and metaphors, as well as electronic dictionaries, corpora, and recently part-of-speech tagging. Her PhD on Concept Frames and Relations, also published as a book in 2003, was co-supervised at the Institute for Romance Languages and at the Department of Informatics in Hamburg. For her Slovenian-German online dictionary, Birte Lönneker was twice awarded the EURALEX Laurence Urdang Award. From 2002 to 2004, she received various research grants for Slovenia, where she was working in the Corpus Laboratory of the Institute of Slovenian Language. Since 2004, Birte Lönneker carries out research on Story Generator Algorithms within the Narratology Research Group Hamburg. She is also a board member of the German Cognitive Linguistics Association.
|
| 17 Jun 2005 | Gully Burns |
The neuroscience laboratory as a knowledge factory: challenges, approaches and tools
Time: 10:30 am - 12:00 pm Location: 11 Large Abstract: As a discipline of biology, the field of neuroscience suffers greatly from information overload, non-standardization and complexity. In the absence of a mathematical theoretical structure for the subject, scientists use their own ad-hoc methods of collating and synthesizing information from both the primary literature and their own data. In order to eventually formalize and accelerate the development of theoretical approaches in the subject, we are combining an Electronic Laboratory Notebook (ELN) with asset management of the primary research literature to construct a knowledge engineering framework based around the organizational unit of a neuroscience laboratory. This project, called ¡NeuroScholar¢ (http://www.neuroscholar.org/) is open-source, and is being tested and used in the laboratories of Prof. Larry Swanson and Prof. Alan Watts at USC. In each laboratory, the system will operate on top of a ¡laboratory corpus¢ of knowledge resources (data files, full-text pdf files , etc.) that summarizes the relevant knowledge for that laboratory. Not only will this collection provide a valuable resource for the members of the laboratory, it provides a platform for natural language processing and knowledge engineering to answer formally-defined research questions. The Society for Neuroscience¢s annual meeting attracts over 30,000 attendees, who collectively form potential user-base of this software. I will talk about the ideas underlying the project, the current implementation of NeuroScholar, developments from collaboration with the natural language group at ISI and possible collaborations for the future.
|
| 13 Jun 2005 | Hal Daume III |
Search, Learning and Features (my thesis proposal proposal)
Time: 10:30 am - 12:00 pm Location: 11 Small Abstract: I'm going to talk about what I've been working on recently. My thesis proposal is something having to do with the interaction of search, learning and features in supervised natural language problems. I will be focusing on the task of coreference, since it is a well-studied problem, yet nevertheless not really solved and quite difficult. It is also a great pedagogical example for why we should care about something *other* than standard Markov random fields for structured prediction, since, for the coreference problem (and pretty much every other "real" natural language problem) inference in such models is intractable.
The contents of this talk will be roughly 40% from a paper I have at ICML
this year on efficient, accurate supervised learning techniques for
structured prediction (and why I feel inclined to make the very
controversial statement that supervised learning for NLP problems is
solved); it will be roughly 40% about an application of this technique to
the coreference resolution problem and an exploration of the feature space
for solving this problem (submitted to HLT); and it will be roughly 20%
about looking forward to what I want to accomplish in the remainder of my
thesis, not covered by the first 80%.
|
| 10 Jun 2005 | Liang Huang (Penn) |
Better k-best Parsing, Hypergraphs and Dynamic Programming
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: We discuss the relevance of k-best parsing to recent applications in natural language parsing, and develop algorithms that substantially improve on previously-used algorithms with respect to efficiency, scalability, and accuracy. We demonstrate these algorithms in experiments on Bikel's implementation of Collins' lexicalized PCFG model, and on a synchronous CFG based decoder for statistical machine translation. We show in particular how the improved output of our algorithms has the potential to improve results from parse reranking systems and other applications. In this talk, I will demonstrate the convergence of several popular parsing formalisms (weighted deduction, shared forest, semiring) under the powerful hypergraph formalism. If time permits, I will also show how generic Dynamic Programming can be formalised as hypergraph searching. Joint work with David Chiang (University of Maryland)
|
| 08 Jun 2005 | Hao Zhang (Rochester) |
Lexicalization and A* Searching for Inversion Transduction Grammar
Time: 3:00 pm - 4:30 pm Location: 4th floor Abstract: The Inversion Transduction Grammar (ITG) of \cite{DekaiCL} generates a synchronous parse tree for a given pair of sentences in two languages. By allowing inversion of the order of children at any level of the synchronous parse tree, ITG can do recursive, systematic word reordering. We made a version of ITG where the nonterminals are lexicalized by word pairs and the inversions are dependent on the so-lexicalized nonterminals. We found out that after lexicalization, the Alignment Error Rate (AER) against gold standard is reduced for short sentences. ITG parsing complexity is high polynomial. We proposed a pruning techique that utilizes IBM Model 1 to estimate the inside and outside probability of a bitext cell. Taking a step further, we applied the A* parsing having been used for monolingual parsing to ITG. I will talk about the heuristic estimates we used for A* parsing for Viterbi alignment selection and decoding.
|
| 27 May 2005 | Radu Soricut |
Towards Developing Generation Algorithms for Text-to-Text
Time: 3:00 pm - 4:30 pm Location: 11 Small Abstract: We describe a new sentence realization framework for text-to-text applications. This framework uses IDL-expressions as a representation formalism, and a generation mechanism based on algorithms for intersecting IDL-expressions with probabilistic language models. We present both theoretical and empirical results concerning the correctness and efficiency of these algorithms.
|
| 13 May 2005 | Ed Stabler (UCLA) |
Natural Logic
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: I will describe some recent work on "natural logics", logics for languages that are more similar to human languages than traditional first order predicate logic, giving particular attention to questions about what the syntax encodes about semantic relations among sentences. On everyone's view, some but not all entailments are syntactically encoded (in a sense that I will define precisely), but, beyond this starting point, controversy starts almost immediately. Considering some particular examples, I will sketch methods for addressing some of the basic questions.
|
| 22 Apr 2005 | Deepak Ravichandran |
Working with Large Corpus, High speed clustering and its applications
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: I am going to be talking about stuff that I have been working over the past 6-9 months. This includes randomized algorithms and its application to 2 NLP problems: noun clustering and noun-pair clustering. I will also be commenting on my experience of working with very very large amounts of real Natural Language text (This includes processing and working with data available from the web. This corpus is not the standard newspaper text that we are so used to in the NLP community.) This talk will also cover a large part of my thesis work. |
| 08 Apr 2005 | Jamie Callan (CMU) |
Search Engines for HLT Applications
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: TBA
|
| 25 Mar 2005 | Dagen Wang |
Metalinguistic feature study for spontaneous speech in human computer interaction
Time: 3:00 pm - 4:30 pm Location: 11 Large (THIS HAS CHANGED!!!) Abstract: Speech is a crucial component in human computer interaction. While tremendous progress has been made in automatic speech recognition, speech transcription -- which is the output of automatic speech recognition -- is far from providing all the information that one could retrieve from speech. For example, prominence, pause, rhythm, and rate of speech all carry important information in speech and are crucial in speech perception. Inclusion of such information can facilitate better machine recognition and understanding of speech.
In this talk, we will introduce the research effort and result in speech
rate, prominence, disfluency and utterance boundary detection. We will
also show some interesting applications utilizing these features in
natural language understanding and dialog management.
|
| 18 Mar 2005 | Ed Hovy |
Methodologies of ontology content construction
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: This talk is the second in three tutorial lectures on ontologies. It first shows some details of various Upper Ontologies-ResearchCYC, SUMO, DOLCE, and the Penman Upper Model. It then discusses the problem of creating content for the 'Middle Model' zone of ontologies, and outlines a methodology for moving from words to word senses to concepts. It concludes by describing ISI's Omega ontology and showing how Omega has been used in annotation projects to support semantic labeling of texts. Please bring a pen or pencil and some paper; there is a small exercise!
|
| 18 Feb 2005 | Inderjeet Mani (Georgetown) |
TBA
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: TBA
|
| 14 Feb 2005 | Tim Chklovski |
Collecting Broad-Coverage Knowledge Bases from Volunteers
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: (Note that this is a MONDAY!)
|
| 11 Feb 2005 | Hae-Chang Rim |
Unsupervised Word Sense Disambiguation Using Wordnet Relatives
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: |
| 28 Jan 2005 | Yutaka Sasaki (ATR) |
Research Activities in Speech Translation at ATR/QA as Question-Biased Term Extraction
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: This talk has two parts. In the first part, I will introduce research activities in Speech-to-Speech Translation at ATR, including on-going research on statistical machine translation. In the second part, I will present a new approach to QA named Question-Biased Term Extraction (QBTE). The QBTE directly extracts answers as terms biased by the question. To confirm the feasibility of our QBTE approach, we conducted experiments on the CRL QA Data based on 10-fold cross validation, using Maximum Entropy Models as an ML technique. Experimental results showed that the trained system achieved approximately 0.35 in MRR and 50% in TOP5 accuracy. This part is an English version of my presentation given in IPSJ SIGNL-163 in 2004 in Japanese. If time allows, I would like to introduce the NTCIR-5 (2004/2005) Cross-Lingual QA task (CLQA) that I am going to organize. About the speaker: Yutaka Sasaki received his Ph.D. in Engineering from the University of Tsukuba, Japan in 2000 for his work on generating Information Extraction rules with hierarchically sored Inductive Logic Programming. He joined NTT Laboratories in 1988. Since then, he was involved in research in rule-based CAI, inductive logic programming, Information Extraction, and Question Answering. From 1995 to 1996, he spent one year at Simon Fraser University, Canada as a visiting researcher. From 1999, he led a subgroup to develop the first practical Japanese Question Answering System SAIQA. Then, he applied SVMs to automatically construct the QA system SAIQA-II from QA and NE data. In June 2004, he moved to ATR Spoken Language Translation Research Laboratories. Currently, he is the head of Department of Natural Language Processing. He is also an organizer of the NTCIR 5 Cross-Lingual Question Answering Task.
|
| 17 Dec 2004 | Nicola Ueffing |
Word-Level Confidence Measures for SMT
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: This talk will address the problem of assessing the correctness of MT output on the word level. I will give an overview on word confidence measures for SMT. Different variants of word posterior probabilities that can be directly used as confidence measure will be presented. Their connection with the Bayes decision rule and the underlying error measure will be shown. Experimental comparison of different word confidence measures will be presented on a translation task consisting of technical manuals. Additionally, I will show how word confidence measures can be applied in an interactive SMT system. This system predicts translations, taking parts of the sentence into account that have already been accepted or typed by the user. Through the use of confidence measures, the performance of the prediction engine can be improved.
About the Speaker: Nicola Ueffing is a graduate research assistant at the group for "Human Language Technology and Pattern Recognition" (Lehrstuhl fuer Informatik VI) at RWTH Aachen University. She received her diploma in mathematics from RWTH Aachen University in 2000. Her research topic is statistical machine translation, focusing on confidence measures for SMT. In 2003, she was a member of the team working on "Confidence Estimation for SMT" at the CLSP workshop at JHU.
|
| 10 Dec 2004 | Nick Mote |
Developing a Language Model for Second Language Learner Speech
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: ISI's Tactical Language Project is a system designed to teach Americans how to speak Arabic through a video game environment. We've taken a FPS engine (Unreal 2003) and re-did the graphics so it looks like you're in a typical Lebanese village. We took away the guns, added speech recognition, and set the player in the middle of it all. The theory is that if you learn well in a classroom, you'll perform well in a classroom, but if you learn well in a pseudo-naturalistic environment, you'll perform better in real life. In a pedagogical context, speech recognition is a hard thing we're trying to recover signal from noisy language-learner speech--with all of its mispronunciations, disfluencies, and grammatical errors . Language understanding is hopeless unless you have a good approximation of what kinds of mistakes learners make, and you can build a system to anticipate them. Suppose an English language learner says "Water". Is he asking you for water? Is he telling you there's a puddle in front of you? Is he saying his name is "Walter", but with horrible pronunciation? There's a lot of ambiguity involved. In order to disambiguate, we need to look at the speech signal itself, the utterance's context, the learner's past language performance, and details about the learner's mother language as it relates to English, etc., etc... Only then can we hope to guess what the learner is actually trying to say. And then, of course, once we've made a good guess at the learner's speech intentions, what do we do about it? How do we correct him? How do we balance the consideration of inherent qualities of learner motivation, language errors, learning objectives, and possibly low-confidence speech recognition, as we generate good pedagogical feedback? This is NLP (primarily statistical) with a bit of pedagogy theory and linguistic (SLA and phonology) theory sprinkled in.
|
| 19 Nov 2004 | Chin-Yew Lin |
After TIDES, What's Left? - Finding Basic Elements
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: As DARPA's TIDES (Translingual Information Detection, Extraction, and Summarization) program coming to an end, I will give a summary of what we have learned from TIDES in summarization and a brief overview of our current effort in developing automatic evaluation methods that go beyond surface n-gram matching. Topics to be covered:
(1) Summary of DUCs 2001 - 2004
(2) Automatic Evaluations in Summarization and MT
(3) Basic Elements - New Efforts in Summarization at ISI
|
| 15 Nov 2004 | Thiago Pardo |
Unsupervised learning of verb argument structures
Time: 3:00 pm - 4:30 pm Location: 8th floor multipurpose room (#849) -- NOT the conference room Abstract: In this talk, I'll present the investigation I'm carrying out in ISI lately under Daniel Marcu's supervision. Following the noisy-channel framework, we propose a statistical model for learning the argument structures of verbs automatically. We show that we are able to learn both lexicalized and generalized structures and achieve good results, relying only on basic NLP tools like a POS tagger and named-entity recognizer. We also present a comparison of the structures we learn with the predicted ones in PropBank.
|
| 12 Nov 2004 | Dragomir Radev |
Words, links, and patterns: novel representations for Web-scale text mining
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Textual data is everywhere, in email and scientific papers, in online newspapers and e-commerce sites. The Web contains more than 200 terabytes of text not even counting the contents of dynamic textual databases. This enormous source of knowledge is seriously underexploited. Textual documents on the Web are very hard to model computationally: they are mostly unstructured, time-dependent, collectively authored, multilingual, and of uneven importance. Traditional grammar-based techniques don't scale up to address such problems. Novel representations and analytical tools are needed. I will discuss several current projects at Michigan related to text mining from a variety of genres. Depending on the amount of time, I will talk about (a) lexical centrality for multidocument summarization, (b) syntax-based sentence alignment, (c) graph-based classification,(d) lexical models of Web growth, and (e) mining protein interactions from scientific papers. As it turns out, the right representations, when complemented with traditional NLP and IR techniques, turn many of these into instances of better studied problems in areas such as social networks, statistical mechanics, sequence analysis, and computational phylogenetics.
About the Speaker: Dragomir R. Radev is Assistant Professor of Information, Electrical Engineering and Computer Science, and Linguistics at the University of Michigan, Ann Arbor. He leads the CLAIR (Computational Lingusitics And Information Retrieval) group which currently includes 12 undergraduate and graduate students. Dragomir holds a Ph.D. in Computer Science from Columbia University. Before joining Michigan, he was a Research Staff Member at IBM's TJ Watson Research Center in Hawthorne, NY. He is the author of more than 45 papers on information retrieval, text summarization, graph models of the Web, question answering, machine translation, text generation, and information extraction. Dr. Radev's current research on probabilistic and link-based methods for exploiting very large textual repositories, representing and acquiring knowledge of genome regulation, and semantic entity and relation extraction from Web-scale text document collections is supported by NSF and NIH. Dragomir serves on the HLT-NAACL advisory committee, was recently reelected as treasurer of NAACL, is a member of the editorial boards of JAIR and Information Retrieval, and is a four-time finalist at the ACM international programming finals (as contestant in 1993 and as coach in 1995-1997). Dragomir received a graduate teaching award at Columbia and recently, the U. of Michigan award for Outstanding Research Mentorship (UROP).
|
| 05 Nov 2004 | Mary Wood (Manchester) |
A Human-Computer Collaborative Approach to Computer Aided Assessment
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: The ABC (Assess by Computer) system has been developed and used in the School of Computer Science at the University of Manchester for formative and (principally) summative assessment at undergraduate and postgraduate level. We believe that fully automatic marking of constructed answers - especially free text answers - is not a sensible aim. Instead - drawing on parallels in the history of machine translation - we take a "human-computer collaborative" approach, in which the system does what it can to support the efficiency and consistency of the human marker, who keeps the final judgement. Our current work focuses on what are generally referred to as "short text answers" as contrasted to "essays". However we prefer to contrast "factual" with "discursive" answers, and speculate that the former may be amenable to simple statistical techniques, while the latter require more sophisticated natural language analysis. I will show some examples of real exam data and the techniques we are using and developing to handle them.
|
| 22 Oct 2004 | Jerry Hobbs |
Like Now: Two Explorations in Deep Lexical Semantics
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: As part of an effort to encode the commonsense knowledge we need in natural language understanding, I have been looking at several very common words and their uses in diverse corpora, and asking what we have to know to understand this word in this context. In this talk, I will describe the investigations of the uses of two words -- the adverb "now" and the preposition "like". One might think that "now" simply expresses a temporal property of an event. But in fact in almost every instance, it is used to point up a contrast -- "This is true now. Something else was true then." It is thus more of a relation than a property. I will describe several categories of such relations. Another question of interest about "now" is "How long a period is the word "now" describing in its various uses?": "I'm typing an abstract now" vs. "We travel by automobile now." I suggest some categories of knowledge that need to be encoded to answer this question. When we successfully understand "A is like B", we have figured out some property that A and B have in common. How can we find that property computationally? In the data I looked at, in 80% of the instances, the property is explicit in the nearby text, and I will talk about how we can identify it. For the remainder I examine the knowledge we would need in order to infer the common property.
|
| 24 Sep 2004 | Hal Daume III |
Domain Adaptation in Maximum Extropy Models
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: I will present some preliminary results on the problem of domain adaptation in maximum entropy models, specifically in the case when there is a large amount of "out of domain" data, and only a very small amount of "in domain" data. The model and algorithms I present are based on the technique of conditional Expectation Maximization (CEM) and allow for relatively fast optimization of these models. Preliminary results on some tasks are quite promising.
|
| 17 Sep 2004 | Various |
About Syntax Fest 2004 (Part II)
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: This summer we held a three-month workshop on syntax-driven machine translation, in which we learned syntactic transformations automatically from Chinese/English translated corpora and applied them to translate new text. We'll give a progress report!
|
| 10 Sep 2004 | Various |
About Syntax Fest 2004 (Part I)
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: This summer we held a three-month workshop on syntax-driven machine translation, in which we learned syntactic transformations automatically from Chinese/English translated corpora and applied them to translate new text. We'll give a progress report!
|
| 16 Aug 2004 | Patrick Pantel & Tim Chklovski |
VerbOcean: Mining the Web for Fine-Grained Semantic Verb Relations
Time: 2:00 pm - 3:30 pm Location: 11 Large Abstract: Broad-coverage repositories of semantic relations between verbs could benefit many NLP tasks. We present a semi-automatic method for extracting fine-grained semantic relations between verbs. We detect similarity, strength, antonymy, enablement, and temporal happens-before relations between pairs of strongly associated verbs using lexico-syntactic patterns over the Web. On a set of 29,165 strongly associated verb pairs, our extraction algorithm yielded 65.5% accuracy. We provide the resource, called VerbOcean, for download at http://semantics.isi.edu/ocean/. We will also discuss current work on disambiguating the verbs in the network as well as refining the semantic relations using path analysis.
|
| 13 Aug 2004 | Deepak Ravichandran |
Randomized algorithms and its application to NLP
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: The last decade has seen a plethora of papers in NLP devoted to Machine Learning algorithms. However, most of these papers have devoted their effort exclusively to improving the system performance on the accuracy axis. Most of the sophisticated NLP algorithms are extremely slow and do not scale up easily when applied to large amounts of data. I will talk about the importance of randomized algorithms and their potential in speeding up some NLP algorithms. This talk will be a survey of some recent advances in Theoretical Computer Science/Math seen with an NLP point-of-view. I am not going to present any results. But I am hoping that this talk will clarify my thinking process, get feedback from people and help me colloborate with others.
|
| 09 Aug 2004 | Justin Busch, Hai Huang, Jens Stephan & Chen-kang Yang |
CL Student Presentations
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Justin Busch: Weight and Semantic Class Issues in Japanese Noun Phrase Ordering Many current designs for automatic parsers learn probabilities for the relative frequencies of parts-of-speech and syntactic rules, and this has proven to be generally reliable. In spite of the ubiquity of probabilistic techniques for parsing, however, little attention has been given to the linguistic significance of the probabilistic data and what it might say about human performance. Hawkins proposes a general theory of grammaticalization based on the minimization of syntactic domains. Given that a sentence of any language will contain at least one noun phrase, one verb, and possibly additional noun phrases and prepositional phrases, "minimize domains" suggests that these phrases will order themselves according to whichever pattern requires the least effort to recognize the higher syntactic structure of the sentence. These effects are directly measurable through corpus statistics, and can be interpreted as potential heuristics for probabilistic parsers. In this study, we examine Japanese data from the Kyoto Treebank and test Hawkins' predictions for noun phrase ordering by noun phrase weight as well as by generic semantic types. The discussion will focus primarily on how accurately Hawkins' predictions are reflected in the corpus statistics, and will conclude with observations about how they might be applied to the decision mechanisms of probabilistic parsers. -------------------------------------------------------------------------- Hai Huang: TBA -------------------------------------------------------------------------- Jens Stephan: Evaluation and Visualization of a Dialogue System Evaluations have become a necessary standard to almost any type of research. However, there are many areas where there is no common agreement on how to evaluate, which is the case for complex problem of evaluating dialogue systems. The evaluation of the multi party multi modal dialogue system MRE(1) provides a good example of what questions are important for such an evaluation, how to actually do the evaluation and finally how to how make special problems of the system visible to use the evaluation results to improve the systems performance. After a brief introduction of the MRE domain and architecture, I will break the task town to a set of general evaluation questions. From there I will explain what kinds of metrics and visualizations are suited to answer those questions and what kind of data is needed, as well as how that data was obtained. Along the road, examples of actual system problems and performances will be presented. The topics of data formatting and visualization will receive some special attention by introducing the MRE Evaluation Toolkit as well as the corpus it operates on. -------------------------------------------------------------------------- Chen-kang Yang: Using the Omega Ontology to Determine Selectional Restrictions for Word Sense Disambiguation
Word sense disambiguation is fundamental for language processing. Though
purely statistical methods are effective for this task, they neglect the
syntactic and semantic aspects. In this study, we adopt a hybrid approach
by applying an unsupervised machine learning method to learn verbs
selectional restrictions on their subjects/objects. The system then uses
these learned selectional restrictions for word sense disambiguation of
the subjects/objects. Instead of words, the training data contains
ontological taxonomy hierarchies that are retrieved from the Omega
ontology. Unlike other similar systems, we are able to automatically find
the best match among classes from different levels of the ontology. This
provides us more flexibility and is closer to human instinct. Our system
performs better than other similar systems, though it still needs
cooperating methods for better results.
|
| 06 Aug 2004 | Hae-Chang Rim |
Information Retrieval using Word Senses: Root Sense Tagging Approach
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Information retrieval using word senses is emerging as a good research challenge on semantic information retrieval. In this presentation, I am going to propose a new method using word senses in information retrieval: root sense tagging method. This method assigns coarse-grained word senses defined in WordNet to query terms and document terms by unsupervised way using co-occurrence information constructed automatically. The sense tagger is crude, but performs consistent disambiguation by considering only the single most informative word as evidence to disambiguate the target word. We also allow multiple-sense assignment to alleviate the problem caused by incorrect disambiguation. Experimental results on a large-scale TREC collection show that the proposed approach to improve retrieval effectiveness is successful, while most of the previous work failed to improve performances even on small text collection. The proposed method also shows promising results when is combined with pseudo relevance feedback and state-of-the-art retrieval function such as BM25.
|
| 16 Jul 2004 | Hal Daume III and Radu Soricut |
Practice Talks for ACL (+workshops)
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: TBA
|
| 09 Jul 2004 | Kevin Knight |
Survey of Trees and Grammars
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: I'll give a survey of trees and grammars, at least the parts that seem most relevant to ongoing work at ISI. This will be a theory talk. I'll start with context-free grammars, which were developed in the 1950s, and cover other tree-generating systems. I'll also talk about tree-transforming systems. |
| 02 Jul 2004 | Hal Daume III |
A Phrase-Based HMM Approach to Document/Abstract Alignment
Time: 1:30 pm - 3:00 pm Location: 11 Large Abstract: I will present work that extends the standard hidden Markov model to a version that can emit multiple symbols in a single time step. Using this model, we are able to automatically create phrase-to-phrase mappings in an alignment process. I've applied this model to the task of creating alignments between documents and their human-written abstracts, yielding an overall alignment F-score of 0.548, a significant improvement on the best results to date of 0.363. These results are published in an EMNLP paper this year, but the talk will be an extended version of the talk I will give there (namely, I will discuss the mechanics of the extended HMM in more detail in this seminar).
|
| 25 Jun 2004 | Dan Gildea |
Syntactic Supervision and Tree-Based Alignment
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: Tree-based probability models of translation have been proposed to take advantage of parse trees on one, both, or neither sides of a parallel corpus. I will present comparative results for these three approaches for the task of word alignment on Chinese-English and French-English data, as well as some analysis of what is going on behind the numbers.
|
| 21 Jun 2004 | Emil Ettelaie |
Speech-to-Speech Translation: A Phrase Classification Approach
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: This talk will be about automatic speech-to-speech translation. In our system, a doctor speaks one language, the patient speaks another language, and the machine translates their utterances from one language to the other. The talk will be followed by a demo of our system. One approach we have been successful with is phrase classification, i.e., classifying a noisy speech-recognized utterance into one of many meaning categories. Phrase classification is computationally cheap and can provide high quality translations for in domain utterances almost instantaneously. Speed is important for speech translation, where processing delay is a great concern. In this talk, different aspects of building a classification-based speech translator are discussed. Following an overview of automatic speech-to-speech translation and its challenges, a comparison of different classification methods is presented and data collection techniques for that application are introduced.
|
| 17 Jun 2004 | Marcello Federico |
Statistical Machine Translation at ITC-irst
Time: 3:00 pm - 4:30 pm Location: 4th Floor Abstract: My presentation will overview recent activities on Chinese-English SMT carried out at ITC-irst (Trento, Italy). After an overview of the complete architecture of our system, I will focus on progress made in Chinese word-segmentation, phrase-based modeling and decoding, log-linear modeling and minimum error training, and language model adaptation. Experimental results will be provided in terms of Bleu and Nist scores on two translation tasks: basic traveling expressions and news reports, respectively adopted by the C-STAR consortium and for the 2002 and 2003 NIST MT evaluation campaigns. Bio: Marcello Federico has been a permanent researcher at ITC-irst since 1991. During 1998-2003, he led the "Multilingual natural speech technologies" (MUNST) research line at ITC-irst. Since 2004, he is head of the "Cross-language information processing" (Hermes) research line. His interests include automatic speech recognition, statistical language modeling, information retrieval, and machine translation.
|
| 24 May 2004 | Philipp Koehn |
Challenges in Statistical Machine Translation
Time: 4:00 pm - 5:00 pm Location: 11 Large Abstract: In the last years a standard model in statistical machine translation has emerged, which is based on the translation of sequences of words (so-called "phrases") at a time. I will describe this model, how to train and decode with it, but the focus of this talk will be how to address the challenges to advance and move beyond the model: my thesis work on noun phrase translation, making use of syntax, and better modeling, such as discriminative training. Bio: Philipp Koehn is the author of papers on natural language processing, machine translation, and machine learning. He received his PhD from the University of Southern California in 2003 (advisor: Kevin Knight), and is currently employed as a postdoc at the Massachusetts Institute of Technology, working with Michael Collins. He has worked at AT&T Laboratories on text-to-speech systems, and at WhizBang! Labs on text categorization.
|
| 21 May 2004 | Tom Murray and Rahul Bhagat |
Statistical Learning for Dialogue System and A Community of Words
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Natural Language Understanding: A fast and accurate Statistical Learning Approach for Dialogue Systems Natural Language Understanding (NLU) is an essential module of a good dialogue system. To achieve satisfactory performance levels, real time dialogue systems need the NLU module to be both fast and accurate. Finite State Model (FSM) based systems are fast and accurate but lack robustness and flexibility. The Statistical Learning Model (SLM) based systems are robust and flexible but lack accuracy and are at most times slow. In this talk, I am going to talk about an SLM based NLU approach for dialogue utterances that is both accurate and fast. The system has high accuracy and produces frames in real time. A Community of Words: Understanding Social Relationships from E-mail A corpus of e-mail messages presents a number of challenges for NLP techniques, with its nearly unconstrained structure and vocabulary, mistyped words and ungrammatical sentences, and extensive contextual information that is never explicitly stated. Yet, the intrinsically social nature of such communication provides an opportunity to study not just a bag of words, but also the relationships, competencies, and activities behind them. This talk presents work with Eduard Hovy as part of the MKIDS project.
|
| 30 Apr 2004 | Liang Zhou |
Automating the Building of Summarization Systems
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Summarization requires one to identify the internal structure of information and to bring that to the surface both operationally and organizationally. How does one put this theory to practice and build real summarization systems? How do the systems built based on this idea perform?
|
| 28 Apr 2004 | Dragos Muntanu, Radu Soricut and Hal Daume III |
Practice Talks for HLT/NAACL
Time: 3:00 pm - 5:00 pm Location: 11 Large Abstract: TBA
|
| 23 Apr 2004 | Hal Daume III |
A Tree-Position Kernel for Document Compression
Time: 3:00 pm - 4:00 pm Location: 10 Large Abstract: I'll describe our entry into the DUC 2004 automatic document summarization competition. We competed only in the single document, headline generation task. Our system is based on a novel kernel dubbed the tree position kernel, combined with two other well-known kernels. Our system performs well on white-box evaluations, but does very poorly in the overall DUC evaluation. C'est la vie. |
| 16 Apr 2004 | Rada Mihalcea (UNT) |
Graph-based Ranking Algorithms for Language Processing
Time: 10:30 am - 12:00 pm Location: 11 Large Abstract: Although we live in a predominantly statistical world, there are still many language processing applications that long for accurate representations of text meaning. Even applications that found partial solutions in statistical modeling, including information retrieval, machine translation, or automatic summarization, are likely to get a significant boost from deeper text understanding. In this talk, I will present an innovative method for automatic extraction of conceptual graphs as a means to represent text meaning. The method relies on a novel adaptation of graph-based ranking algorithms - traditionally (and successfully) used in citation analysis, Web page ranking, and social networks. I will show how such algorithms can be adapted to semantic networks, resulting in an efficient unsupervised method for resolving the semantic ambiguity of all words in open text, and identifying relations between entities in the text. I will also outline a number of applications that are enabled by this representation, including keyphrase extraction, domain classification, and extractive summarization.
BIO: Rada Mihalcea is an Assistant Professor of Computer Science at
University of North Texas. Her research interests are in lexical
semantics, minimally supervised natural language learning, and
multilingual natural language processing. She is currently involved in a
number of research projects, including word sense disambiguation, shallow
semantic parsing, (non-traditional) methods for building annotated corpora
with volunteer contributions over the Web, word alignment for language
pairs with scarce resources, and graph-based ranking algorithms for
language processing. Her research is supported by NSF and the state of
Texas.
|
| 13 Apr 2004 | Jill Burstein (ETS) |
Automated Essay Evaluation: From NLP research through deployment as a business
Time: 3:00 pm - 4:30 pm Location: 4 Large Abstract: Automated essay scoring was initially motivated by its potential cost savings for large-scale writing assessments. However, as automated essay scoring became more widely available and accepted, teachers and assessment experts realized that the potential of the technology could go way beyond just essay scoring. Over the past five years or so, there has been rapid development, and commercial deployment of automated essay evaluation for both large-scale assessment and classroom instruction. A number of factors contribute to an essay score, including varying sentence structure, grammatical correctness, appropriate word choice, errors in spelling and punctuation, use of transitional words/phrases, and organization and development. Instructional software capabilities exist that provide essay scores and evaluations of student essay writing in all of these domains. The foundation of automated essay evaluation software is rooted in NLP research. This talk will walk through the development of CriterionSM, e-rater, and Critique writing analysis tools, automated essay evaluation software developed at Educational Testing Service - from NLP research through deployment as a business. (Preview of an HLT/NAACL-2004 Invited Speaker Presentation) Jill Burstein Educational Testing Service Princeton, NJ
|
| 09 Apr 2004 | Eduard Hovy |
Three (and a half?) Trends: The Future of NLP
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: An interesting (disturbing?) new trend is beginning to manifest itself in NLP, one that is focused on performance and hence very attractive in the context of inter-system competitive evaluations such as TREC and DUC, but one that does not provide much insight about language or NLP methods to the researcher interested in these topics. This addition of a new paradigm to NLP has implications for all of us.
|
| 02 Apr 2004 | Stephan Vogel |
The CMU Statistical Machine Translation System
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: The presentation will give an overview of the SMT activities at the Language Technologies Institute, Carnegie Mellon University, in large vocabulary text translation tasks, esp. the Chinese-English and Arabic-English, as well as in limited domain speech-to-speech translation tasks. The CMU SMT system is, like most modern statistical MT systems, based on phrase translation. Several approaches have been developed to extract the phrase pairs from parallel corpora and current research investigates different scoring approaches for these translation pairs. Details of the decoder, esp. on hypothesis recombination, pruning, and efficient n-best list generation will be given. Recently, the SMT system has been extended to use partial translations generated from example based and grammar based translation system, thereby performing multi-engine machine translation. Bio: Stephan Vogel is a researcher at the Language Technologies Institute, Carnegie Mellon University, where he heads the statistical machine translation team. He received a Diploma in Physics from Philips University Marburg, Germany, and a Masters of Philosophy from the University of Cambridge, England. After working for a number of years on the history of science, he turned to computer science, especially natural language processing. Before coming to CMU, he worked for several years at the Technical Univerity of Aachen on statistical machine translation, and also in the Interactive Systems Lab at the University of Karlsruhe.
|
| 26 Mar 2004 | Shlomo Argamon |
On Writing, Our Selves: Explorations in Stylistic Text Categorization
Time: 1:30 pm - 3:00 pm Location: 11 Large Abstract: This talk will survey results of several recent projects we have been undertaking in automated text categorization based upon the style, rather than the topic, of the documents. I will describe a general text-categorization framework using machine learning along with general principles for choosing stylistically relevant sets of features for learning effective classification models. Applications of these methods include determining author gender and text genre in published books and articles, authorship attribution of email messages, and analysis of language use in different scientific fields. In many cases, the models that are learned also give some insight into the respective styles being distinguished, which I will also discuss. Shlomo Argamon is an associate professor at the Illinois Institute of Technology Chicago.
|
| 25 Mar 2004 | Jon Patrick (U. of Sydney) |
ScamSeek: Capturing Financial Scams at the Coalface by Language Technology
Time: 10:30 am - 12:00 pm Location: 11 Large Abstract: The Scamseek project aims to build a surveillance tool for identifying financial scams on the Internet by performing document classification of Internet pages. There are three principle types of documents of concern: those that give financial advice by unregistered advisors, unlawful investment schemes, and share ramping. The first phase of the project has been completed and a working system, known as ScamAlert installed at the Australian Securities and Investment Commission (ASIC). The independent audit of the performance of the system proved satisfactory with a result for precision of .75, recall .43, and F=. 54, along with identification of 4 scams misclassified by the client. Significant improvement in recall is foreshadowed in the 2nd phase of the project. The results are satisfying in the context of the structure of the data where the density of scam documents is about 1.8% of the total corpus. The good performance of the operational system is ascribed to the combination of using a strong linguistic model of language (Systemic Functional Linguistics) to define the scam documents in parallel with a rich statistical analysis of the structure of non-scam documents and scam look-alikes. A large amount of the experimental program has concentrated on understanding and exploiting the interaction between the linguistically described aspects of the documents and the statistical properties. Each type of data has been used to inform and modify the usage of the other. The operational aspects of the project have proven to be as challenging as the research objectives. The project has a budget of $2.2M over 15 months. It has been managed so as to create a balance in resources between the needs of both the research objectives and the engineering objectives. Software development has concentrated on three aspects. Firstly, to produce an environment for the strong directive management of computational linguistics experiments, secondly, in the aid of the linguists to create tools to support their manual analysis, and thirdly the best practice of software engineering principles to ensure a clean automated rollout of the production system for ASIC.
The contributing partners in the Scamseek project are The Capital Markets
Co-operative Research Centre (CMCRC), ASIC, the University of Sydney and
Macquarie University.
|
| 12 Mar 2004 | Deepak Ravichandran |
About My Thesis Proposal
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: TBA
|
| 20 Feb 2004 | Hal Daume III |
Some Results in Automatic Evaluation for Summarization and MT
Time: 3:00 pm - 4:00 pm Location: 4 Large Abstract: I will be presenting some recent results of mine regarding the possibility of automatic evaluation in summarization. I will discuss both my own findings, as well of those of people here and at Columbia, and attempt to explain in a principled fashion why there are disparate opinions on the plausibility of performing automatic evaluation in this task. I will discuss my (perhaps pessimistic) views on the plausibility of doing any sort of evaluation of summarization, automatic or otherwise. The results and experimental setups developed in connection with summarization will be extended to the machine translation. I will review possible reasons why metrics such a bleu have experienced significantly more success in machine translation than in summarization. I will also connect the evaluation criterea developed in the context of summarization to machine translation, and discuss the automation of these methods. In short: I'll talk about why I've been doing so much data elicitaiton recently. This will be a highly informal seminar and participation is highly encouraged.
|
| 06 Feb 2004 | Mark Hopkins |
What's in a Translation Rule?
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: We propose a theory that gives formal semantics to word-level alignments defined over parallel corpora. We use our theory to introduce a linear algorithm that can be used to derive from word-aligned, parallel corpora the minimal set of syntactically motivated transformation rules that explain human translation data. (joint work with Michel Galley, Kevin Knight, and Daniel Marcu)
|
| 30 Jan 2004 | Paul Kingsbury (Penn) |
PropBank: the next stage of Treebank and Inducing a Chronology of the Pali Canon Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: PropBank: the next stage of Treebank Natural-language engineers the world over are coming to a consensus that a degree of semantic knowledge is a necessary addition to purely structural representations of language. This talk describes the Propbank project at Penn, which provides a complete shallow semantic parse of the Treebank II corpus. Inducing a Chronology of the Pali Canon: Works such as Kroch (1989), Taylor (1994) and Han (2000) have demonstrated that syntactic change can be described mathematically as the competition between innovating and archaic formations. This paper demonstrates how this same mathematical description can be turned around to predict the date of a historical text. The Middle Indic period showed dramatic change in the morphological system, such as the collapse of the past-tense verbal system. Whereas Sanskrit had three competing formations, each with multiple possible morphological realizations, Pali (a Middle Indo-Aryan language) had only a single formation, based mostly on the sigmatic aorist although many archaic nonsigmatic aorists are also attested. The proportions of the archaic and innovative forms can be easily calculated for each text in the Pali Canon and these proportions used to assign an approximate date for each text. The accuracy of the method can be assessed qualitatively by comparing the derived chronology to chronologies based on various non-linguistic criteria, or quantitatively by comparing the derived chronology to a known dating scheme. For the latter it is necessary to turn to a different dataset, such as that describing the rise of do-support in Early Modern English, as described in Ellegard (1953) and Kroch (1989). Bio: Paul Kingsbury graduated summa cum laude in linguistics from Ohio State University in 1993 with a thesis on "Some sources for L-words in Sanskrit". He subsequently entered the University of Pennsylvania to study historical linguistics and Sanskrit, but (like most historical students) was diverted to computational issues. He joined the Propbank project in 2000 and soon thereafter engineered a major rethinking of the methods and goals of the project, in order to make the annotation linguistically meaningful. He completed his doctorate in 2002 with a thesis entitled 'The Chronology of the Pali Canon: the case of the aorist'.
|
| 16 Jan 2004 | John Prager (IBM) |
Using Constraints to Improve Question-Answering Accuracy
Time: 2:00 pm - 3:00 pm Location: 11 Large Abstract: Leading Question-Answering systems employ a variety of means to boost the accuracy of their answers. Such methods include redundancy (getting the same answer from multiple documents/sources), deeper parsing of questions and texts (hence improving the accuracy of confidence measures), inferencing (proving the answer from information in texts plus background knowledge) and sanity-checking (verifying that answers are consistent with known facts). To our knowledge, however, no QA system deliberately asks additional questions in order to derive constraints on the answers to the original questions.
We present in this talk the method of QA-by-Dossier-with-Constraints (QDC).
This is an extension of the simpler method of QA-by-Dossier, in which
definitional questions ("Who/what is X") are addressed by asking a set of
questions about anticipated properties of X. In QDC, the collection of
Dossier candidate answers, along with possibly other answers to questions
asked expressly for this purpose, are subjected to satisfying a set of
naturally-arising constraints. For example, for a "Who is X" question, the
system will ask about birth, accomplishment and death dates, which, if they
exist, must occur in that order, and also obey other constraints such as
lifespan. Temporal, spatial and kinship relationships seem to be
particularly amenable to this treatment, but it would seem that almost any
"factoid" question can benefit from QDC. We will discuss the setting-up
and application of constraint networks, and talk about how (and whether) to
develop the constraint sets automatically. We will demonstrate several
applications of QDC, and present one evaluation in which the F-measure for
a set of questions improved with QDC from .39 to .69.
|
| 19 Dec 2003 | Robert Krovetz (Ask Jeeves) |
More than One Sense Per Discourse
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: Previous research has indicated that when a polysemous word appears two or more times in a discourse, it is extremely likely that they will all share the same sense (Gale et al. 92). However, those results were based on a coarse-grained distinction between senses (e.g, {\em sentence} in the sense of a `prison sentence' vs. a `grammatical sentence'). I conducted an analysis of multiple senses within two sense-tagged corpora, Semcor and DSO. These corpora used WordNet for their sense inventory. I found significantly more occurrences of multiple-senses per discourse than reported in (Gale et al. 92) (33\% instead of 4\%). I also found classes of ambiguous words in which as many as 45\% of the senses in the class co-occur within a document. I will discuss the implications of these results for the task of word-sense tagging and for the way in which senses should be represented. |
| 25 Nov 2003 | Hang Li (MSR Beijing) |
Using Bilingual Data to Mine and Rank Translations
Time: 10:30 pm - 12:00 pm Location: 11th Floor Large Abstract: In this talk, I will introduce some of the technologies which we have developed in the project on an English reading assistant system called English Reading Wizard. The technologies include a method for mining translations from web (unparallel corpora), a method for word translation disambiguation based on bootstrapping, which is called Bilingual Bootstrapping, and a general method of bootstrapping, which is called Collaborative Bootstrapping. First, I will introduce the main features of English Reading Wizard. Next, I will introduce each of the methods. The translation mining method is based on a naïve Bayesian ensemble and the EM algorithm. Bilingual Bootstrapping uses the asymmetric translation relationship between words in the two languages in translation and can construct reliable classifiers for word translation disambiguation. Collaborative Bootstrapping contains the co-training algorithm as its special case, and it uses the strategy of uncertainty reduction in training of the two classifiers. Bio: Hang Li is a researcher at the Natural Language Computing Group of Microsoft Research in Beijing, China. He is also adjunct professor of Xian Jiaotong University. Hang Li obtained a B.S. in Electrical Engineering from Kyoto University (Japan) in 1988 and a M.S. in Computer Science from Kyoto University in 1990. He earned his Ph.D. in Computer Science from the University of Tokyo in 1998. >From 1990 to 2001, Hang Li worked at the Research Laboratories of NEC Corporation in Kawasaki, Japan. He joined Microsoft Research in 2001. His research interest includes statistical learning, natural language processing, data mining, and information retrieval. Hang Li's web site: http://research.microsoft.com/users/hangli/
|
| 17 Nov 2003 | Dr. Kato and Dr. Fukomoto (NTCIR) |
An Overview of the QA Challenge + NTCIR -- The Way Ahead
Time: 10:30 am - 12:00 pm Location: 4th Floor Abstract: An Overview of Question Answering Challenge Jun'ichi Fukumoto and Tsuneaki Kato In this talk, we will present an overview of Question Answering Challenge(QAC), which is the question answering task of the NTCIR Workshop. QAC-1 (the first evaluation of QAC) was carried out at NTCIR Workshop 3 in October 2002, and QAC-2 will be at NTCIR Workshop 4 in December 2003. In the QAC, systems to be evaluated are expected to return exact answers consisting of a noun or noun compound denoting, for example, the names of persons, organizations, or various artifacts or numerical expressions such as money, size, or date. Those basically range over the Named Entity (NE) elements of MUC and IREX but is not limited to them. QAC consists of three kinds of subtasks: Task 1, where the systems are allowed to return ranked five possible answers; Task 2, where the systems are required to return a complete list of answers; and Task 3, the systems are required to answer series of questions, that have anaphora and zero-anaphora. We will present the results of QAC-1, and vision and prospect of QAC-2. NTCIR -- the Way Ahead Noriko Kando Dr. Noriko Kando is the leader of NTCIR(Test Collections and Evaluation of IR, Text Summarization, Q&A, etc) project, and an associate professor of National Institute of Informatics (NII). She got her Ph. D in 1995 from Keio University. Her research interest includes evaluation of information retrieval systems, technologies to "Make Information Usable for Users", cross-lingual information retrieval, and analysis of text structure, genre, citation & link She is a member of editorial boards of International Journal on Information Processing and Management, ACM-Transaction on Asian Language Information Processing, etc. Jun'ichi Fukumoto and Tsuneaki Kato are task organizers of QAC. Dr. Jun'ichi Fukumoto is an associate professor of Ritsumeikan University. He got his Ph. D in 1999 from University of Manchester Institute of Science and Technology. His research interest includes Q&A, automatic summarization, and dialogue processing. Dr. Tsuneaki Kato is an associate professor of the University of Tokyo. He got his Dr. of Engineering in 1995 from Tokyo Institute of Technology. His research interests includes multimodal dialogue processing, multimodal presentation generation and domain independent question and answering. He is a member of editorial committee of transaction on information and systems of The Institute of Electronics, Information and Communication Engineers.
|
| 27 Oct 2003 | Christopher Manning (Stanford) |
Natural Language Parsing: Graphs, the A* Algorithm, and Modularity
Time: 10:00 am - 11:00 am Location: 11 Large Abstract: Probabilistic parsing methods have in recent years transformed our ability to robustly find correct parses for open domain sentences. Much of this work has been within a common architecture of heuristic search for good pares in lexicalized probabilistic context-free grammars, with many layers of back-off to avoid problems of sparse data. In this talk, I will outline some different ideas that we have been pursuing. I will connect stochastic parsing with finding shortest paths in hypergraphs, and show how this approach naturally provides a chart parser for arbitrary probabilistic context-free grammars (finding shortest paths in a hypergraph is easy; the central problem of parsing is that the hypergraph has to be constructed on the fly). From this viewpoint, a natural approach is to use the A* algorithm to cut down the work in finding the best parse. On unlexicalized grammars, this can reduce the parsing work done dramatically, by at least 97%. This approach is competitive with methods standardly used in statistical parsers, while ensuring optimality, unlike most heuristic approaches to best-first parsing. Finally, I will present a novel modular generative model in which semantic (lexical dependency) and syntactic structures are scored separately. This factored model is conceptually simple, linguistically interesting, admits exact inferenence with an extremely effective A* algorithm, and provides straightforward opportunities for separately improving the component models. In particular, I will mention some of the work we have done focusing on the PCFG component to produce a very high accuracy unlexicalized grammar. This is joint work with Dan Klein. About the Speaker: Christopher Manning is an Assistant Professor of Computer Science and Linguistics at Stanford University. He received his Ph.D. from Stanford University in 1995, and served on the faculty of the Computational Linguistics Program at Carnegie Mellon University (1994-1996) and the University of Sydney Linguistics Department (1996-1999) before returning to Stanford. His research interests include probabilistic models of language, natural language parsing, constraint-based linguistic theories, syntactic typology, information extraction and text mining, and computational lexicography. He is the author of three books, including Foundations of Statistical Natural Language Processing (MIT Press, 1999, with Hinrich Schuetze).
Chris' schedule is available in Postscript or
PDF format.
|
| 17 Oct 2003 | Hovy, Marcu, Knight, Byrd, Narayanan, Traum, Gordon |
Introduction to CL Research
Time: 3:00 pm - 4:30 pm Location: 11 Large Abstract: The annual Computational Linguistics Open House will be held at USC's Information Sciences Institute from 3:00-4:30pm in the 11th floor Conference Room. Researchers from ISI, including Eduard Hovy, Daniel Marcu, and Kevin Knight will present overviews of their latest research. We will also hear about the research activities of Dani Byrd of the Linguistics Department, Shri Narayanan's group in EE, and David Traum and Andrew Gordon of USC's Institute for Creative Technologies.
|
| 10 Oct 2003 | Philipp Koehn |
Advances in Statistical MT: Phrases, Noun Phrases and Beyond
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: (This is a practice run for I talk I will give a few times over the next weeks when interviewing for job positions.) I will review the state of the art in statistical machine translation (SMT), present my dissertation work, and sketch out the research challenges of syntactically structured statistical machine translation. The currently best methods in SMT build on the translation of phrases (any sequences of words) instead of single words. Phrase translation pairs are automatically learned from parallel corpora. While SMT systems generate translation output that often conveys a lot of the meaning of the original text, it is frequently ungrammatical and incoherent. The research challenge at this point is to introduce syntactic knowledge to the state of the art in order to improve translation quality. My approach breaks up the translation process along linguistic lines. I will present my thesis work on noun phrase translation and ideas about clause structure.
|
| 03 Oct 2003 | Anton Leuski |
A Year in Paradise
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: I would like to talk about some of the things I did during the last year. I will discuss and demonstrate CuSTaRD, a cross-lingual information retrieval, organization, summarization, and visualization system that was built for the Surprise Language exercise. I will focus in more details on iNeATS, the interactive multi-document summarization part of CuSTaRD. The other project I plan to present is eArchivarius, a system for accessing collections of electronic mail.
|
| 02 Oct 2003 | Ana-Maria Popescu |
TBA
Time: 4:00 pm - 5:00 pm Location: 11 Large Abstract: |
| 15 Sep 2003 | Beata Klebanov |
Analyzing Sentences into Facts: Simple is Beautiful
Time: 2:30 pm - 4:00 pm Location: 11 Large Abstract: I present my summer project - writing rule-based software for simplifying texts. Task definition and motivations will be discussed, as well as human and automatic evaluation, the latter using a question answering system. This is joint work with Daniel Marcu and Kevin Knight.
|
| 12 Sep 2003 | Lara Taylor |
Discourse Coherence for Ordering Information
Time: 2:30 pm - 4:00 pm Location: 11 Large Abstract: In this talk, I look at how the notion of discourse coherence can be modeled computationally. I begin with the following idea: if you take a text and shuffle its sentences into a random order, that text will no longer make sense. In other words, the text will be "incoherent". Our task is to learn how to reassemble a shuffled text into an order that humans would consider to be coherent. I discuss practical and theoretical motivations for the task, evaluations of our model, increases in performance achieved over the summer, and directions for future research. This work was done in collaboration with Kevin Knight, Daniel Marcu, Jonathan Graehl and Nick Mote.
|
| 05 Sep 2003 | Nishit Rathod and Anish Nair |
Deciphering Hindi Scripts
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: A major hurdle in building automated information retrieval systems for Hindi text is the lack of an uniform encoding for text representation. Standards do exist, but noone seems interested. Every web content publisher seems to have their encoding system, making information extraction a nightmare. We explore an unsupervised approach to convert any given "unknown" encoding to UTF-8, by treating it as a decipherment problem. We also study how a little amount of supervision can improve decoding accuracy.
|
| 03 Sep 2003 | Alex Fraser and Franz Och |
JHU MT Workshop
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: We will present the results of the 2003 Johns Hopkins University Summer Workshop on "Syntax for Statistical Machine Translation". We will describe a large effort to extend a high-performing phrase-based MT system as baseline by adding new features representing syntactic knowledge that deal with specific problems of the underlying baseline. We investigate a broad range of possible feature functions, from very simple binary features to sophisticated tree-to-tree translation models. Simple feature functions test if a certain constituent occurs in the source and the target language parse tree. More sophisticated features will be derived from an alignment model where whole sub-trees in source and target can be aligned node by node. We present results on the Chinese-English large data track of the recent TIDES MT evaluations. This is joint work with the other workshop team members: Daniel Gildea, Anoop Sarkar, Sanjeev Khudanpur, Kenji Yamada, Libin Shen, Shankar Kumar, David Smith, Viran Jain, Katherine Eng, Jin Zhen and Dragomir Radev. See http://www.clsp.jhu.edu/ws03/groups/translate/ for more.
|
| 29 Aug 2003 | Stefan Riezler |
Deepening Representations
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: |
| 27 Aug 2003 | Michel Galley and Mark Hopkins |
Syntax for Statistical MT
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: |
| 22 Aug 2003 | Satoshi Sekine |
Information Extraction, IR and QA
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: |
| 15 Aug 2003 | Beata Klebanov |
On Her Masters Research
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: |
| 01 Aug 2003 | Shou-de Lin |
Toward deciphering the 2-dimensional ancient Luwian script by discovering its writing order
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: |
| 29 Jul 2003 | Michael Brasser |
A Model of Word Movement for Machine Translation
Time: 3:00 pm - 4:00 pm Location: 11 Small Abstract: |
| 25 Jul 2003 | Jonathan Graehl and Kevin Knight |
Super-Carmel for Trees
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: |
| 18 Jul 2003 | Doug Oard |
A Maryland Yankee in King Eduard's Court: Some Remarks on a Year in Paradise
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: |
| 27 Jun 2003 | Michael Fleischman |
Offline Strategies for Online Question Answering: Answering Questions Before They Are Asked and Maximum Entropy Models for FrameNet Classification
Time: 3:00 pm - 4:00 pm Location: 10 Large Abstract: |
| 12 Jun 2003 | Dina Demner-Fushman |
Measuring the Effect of Dictionary Coverage on Cross-Language Retrieval
Time: 11:00 am - 12:00 pm Location: 11 Large Abstract: Bilingual term lists have proven to be a useful basis for dictionary-based Cross-Language Information Retrieval (CLIR), but there is ample anecdotal evidence that differences in vocabulary coverage can have a substantial impact on retrieval effectiveness. This issue has recently been explored using ablation studies in which progressively smaller term lists were synthesized using sampling techniques. The ablation techniques used in those studies have not, however, been validated using real terms lists. In this talk I will report the results of what we believe is the first large coverage study use naturally occurring term lists. Thirty-five bilingual terms lists were obtained from a variety of sources, each with English as one of the two paired languages. From these, we created 35 English-to-English term lists by taking each term that was present in the English side of the list as its own translation. When used with an English information retreval test collection, this allowed us to measure the reduction in retrieval effectivenss that could be attributed to deficiencies in the coverage of English terms. Eight types of untranslatable terms were identified in a collection of news stories, of which named entitles were found to have the greatest impact on retrieval effectiveness. Differences in named entity coverage were found to produce large differences in retrieval effectiveness for term lists of similar sizes. Controlling for named entity effects yielded a clear relationship between retrieval effectiveness and the size of the translatable English vocabulary. The functional dependence that we observed is consistent with one previously applied ablation technique and inconsistent with another. Our results indicate that the outcome of a widely cited landmark study of query expansion effects for CLIR was likely affected by a flawed ablation model. We conclude our talk with a suggestion for further work on that topic, and a simple prescription for avoiding such problems in the future.
|
| 23 May 2003 | Liang Zhou |
A Web-Trained Extraction Summarization System and Headline Summarization at ISI
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: 1) A serious bottleneck in the development of trainable text summarization systems is the shortage of training data. Constructing such data is a very tedious task, especially because there are in general many different correct ways to summarize a text. Fortunately we can utilize the Internet as a source of suitable training data. In this paper, we present a summarization system that uses the web as the source of training data. The procedure involves structuring the articles downloaded from various websites, building adequate corpora of (summary, text) and (extract, text) pairs, training on positive and negative data, and automatically learning to perform the task of extraction-based summarization systems. 2) Headlines are useful for users who only need information on the main topics of a story. We present a headline summarization system that is built at ISI for this purpose and is a top performer for DUC2003's task 1, generating very short summaries (10 words or less).
|
| 20 May 2003 | Michel Galley |
Discourse Segmentation of Multi-Party Conversation
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: |
| 16 May 2003 | Chin-Yew Lin |
Automatic Evaluation of Summaries Using N-gram Co-Occurrence Statistics
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: Following the recent adoption by the machine translation community of automatic evaluation using the BLEU/NIST scoring process, we conduct an in-depth study of a similar idea for evaluating summaries. The results show that automatic evaluation using unigram co-occurrences between summary pairs correlates surprising well with human evaluations, based on various statistical metrics; while direct application of the BLEU evaluation procedure does not always give good results.
|
| 09 May 2003 | Doug Oard |
Coping with Surprise: The Case of Cebuano
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: For ten days in March, nine research teams worked together to build Cebuano language resources and systems for a "dry run" the TIDES Suprise Language experiment. Cebuano is spoken widely in the southern Phillipines, but there had previously been little work on computational linguistics for that language. As we prepare for the actual Suprise Language experiment this June, we will use this talk to look back on what worked, what didn't, and what lessons there are to be learned from our experience in March. Come prepared to share the excitement, offer your ideas, and understand why we have tried to ask Ed to cancel all vacations during the month of June (just kidding...).
|
| 02 May 2003 | Hal Daumé III |
Acquiring Paraphrase Templates from Document/Abstract Pairs
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: We present an approach to automatically extracting paraphrase templates from document/abstract pairs. This methodology relies on word-based alignments created by off-the-shelf software. Our paraphrases are evaluated by human evaluators for precision and automatically for applicability. We find that 77% of the extracted paraphrases are judged to be always correct and that the generalized templates of 60% are judged to be applicable most of the time and 87% are judged to be applicable sometimes.
|
| 25 Apr 2003 | Quamrul Tipu |
Statistical MT with Bilingual Morphology
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: Traditional statistical MT systems mostly work on the word- andphrase-level. For different language pairs, the performance of such systems vary from some 15% to 35%. These systems suffer from problems such as sparse data, with huge vocabulary sizes leading to less reliable probability estimates. In our current research, we aim to come up with a better MT system by looking inside the words. Almost in every language, a root (stem) can have many different forms (inflectional, derivational, etc.). If we can identify the roots, the size of the vocabulary will quite small, and we can have better probability estimates, reducing the sparse data problem and potentially leading to higher accuracy. We are trying to come up with a model that induces morphology automatically from a bilingual corpus and achieves this improvement.
|
| 04 Apr 2003 | Donghui Feng |
Natural Language Understanding in MRE
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: In this talk, I will present my current work on language understanding in the project, Mission Rehearsal Exercise(MRE). One of the challenges in a dialogure system is to provide a robust understanding/parsing compoment. We applied both Finte State Model and Statistical Learning Model for the parsing of separate sentences of dialogue utterances. Their performances are evaluated and compared with a new blind set. And we hope to incorporate them to make a better solution in this specific application.
|
| 21 Mar 2003 | Gareth Jones |
An Investigation of the Application of Broad Coverage Automatic Pronoun Resolution in Information Retrieval
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: Term weighting methods have been shown to give significant increases in information retrieval performance. Term weights are typically calculated using frequency counts across the whole retrieval collection, frequency of each term within individual documents and compensation for varying document length. The presence of pronomial references in documents effectively reduces the within document term frequency of associated words with a consequent effect on term weights and information retrieval behaviour. This presentation will describe an experimental investigation into the impact on information retrieval performance of broad coverage automatic pronoun resolution. Results using a standard information retieval test collection indicate that calculating term weights using a pronoun resolved version of the document test collection can improve both fixed cutoff and average retrieval precision.
|
| 14 Mar 2003 | Kareem Darwish |
Improving the Efficiency and Effectiveness of Structured Query Methods
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: One of the key challenges in retrieval is what to do when a query term needs to be replaced with more than one term. This problem arises in applications such as cross language information retrieval and thesaurus expansion. One solution is to use structured query methods, which treat all the possible replacements as if they were one query term by computing a joint document frequency and a joint term frequency. This presentation will review prior work on structured query techniques and then introduce three new variants that aim to improve computational efficiency and to leverage estimates of replacement probabilities to improve retrieval effectiveness. The methods have now been tested in cross-language retrieval and OCR-degraded text retrieval applications in which replacement probability estimates could be estimated. In both applications, the new structured query methods showed statistically significant improvements in retrieval effectiveness over previously known structured query methods.
|
| 07 Mar 2003 | Scott Klemmer |
Books with Voices: Paper Transcripts as a Tangible Interface to Oral Histories
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: Our contextual inquiry into the practices of oral historians unearthed a curious incongruity. While oral historians consider interview recordings a central historical artifact, these recordings sit unused after a written transcript is produced. We hypothesized that this is largely because books are more usable than recordings. Therefore, we created Books with Voices: bar-code augmented paper transcripts enabling fast, random access to digital video interviews on a PDA. We present quantitative results of an evaluation of this tangible interface with 13 participants. They found this lightweight, structured access to original recordings to offer substantial benefits with minimal overhead. Oral historians found a level of emotion in the video not available in the printed transcript. The video also helped readers clarify the text and observe nonverbal cues. |
| 28 Feb 2003 | Radu Soricut |
Sentence Level Discourse Parsing using Syntactic and Lexical Information
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: We introduce two probabilistic models that can be used to identify elementary discourse units and build sentence-level discourse parse trees. The models use syntactic and lexical features. A discourse parsing algorithm that implements these models derives discourse parse trees with an error reduction of 18.8\% over a state-of-the-art decision-based discourse parser. A set of empirical evaluations shows that our discourse parsing model is sophisticated enough to yield discourse trees at an accuracy level that matches near-human levels of performance.
|
| 21 Feb 2003 | Nate Chambers |
Statistical Language Generation in a Dialogue System
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: The large corpora of written text that is available to the language community has largely been utilized for language understanding; it has somewhat been ignored in the context of language generation. Recent developments in stochastic generation have allowed such systems to shift the burden from hand crafted databases (lexicons, grammars, ontologies) to the knowledge implicitly found in written text. However, when building a dialogue system, generation is largely interactive, very different from the written structure of most corpora. In this talk, I will discuss my recent work at applying a stochastic generator, HALogen, and its newswire language model to a dialogue system, TRIPS. I'll describe the difficulties in mapping the TRIPS semantic form into HALogen's representation, the critical differences between newswire and dialogue, and the possibility of using HALogen and a large newswire model as a domain independent generator.
|
| 07 Feb 2003 | Jeongwon Cha |
Automatic Pattern Learning for Information Extraction using Web Data
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: I will give a status report work on information extraction during last 10 months. The motivation of this work is to learn extraction patterns automatically using seed template and web search engine. My approach is to generate linguistics patterns and surface patterns and combine them to compenstate for the respective weaknesses of two patterns. On the DUC01-test-disasters (67 documents), DUC01-training-disasters (54 documents) I got a 0.34/0.26 f-measure respectively. In this talk, I will give a status report on ReAD project (with Dr. Chin-Yew Lin).
|
| 31 Jan 2003 | Philipp Koehn |
Noun Phrase Translation
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: I will give a status report on my current thesis work on noun phrase translation. The motivation of this work is to break up the machine translation problem into smaller, more manageable units. The treatment of noun phrase translation as a subtask of machine translation is both linguistically and empirically motivated. My approach is to generate a n-best list of candidate translations with a statistical machine translation system and rerank the candidates with additional features. For about 90% of all noun phrases we can find an acceptable translation in the 100-best list, while an acceptable translation comes out on the very top for only about 60% of the noun phrases. I will discuss a variety of linguistic and empirical features that (may) help to move the acceptable translations higher in the list. I will also present results modeling issues such as phrase based translation and compound splitting. This talk is also intended as a fishing expedition for feature suggestions by the audience.
|
| 24 Jan 2003 | Doug Oard & Anton Leuski |
Access to Archival Collections of Electronic Mail
Time: 3:00 pm - 4:00 pm Location: 11 Large Abstract: Since its inception more than 30 years ago, electronic mail (email) has developed into a powerful communication medium with applications that extend well beyond simple asynchronous message exchange between individuals. Automated tools to support the use of email in individual, organizational and social contexts have received increasing attention in recent years. Among the tasks that are now supported are filtering (e.g., spam detection), aggregation (e.g., mailing list digests), workflow management (e.g., help desk routing), and reuse (e.g., retrospective search). We are interested in how today's email will be used in the future -- some will certainly be preserved (indeed, some MUST be preserved!), and those records will serve as powerful evidence of how we lived our lives and organized our societies. The challenges of managing many types of electronic record collections are receiving increasing attention, but we are not aware of any work yet on supporting access to electronic mail archives. That will be the focus of this talk. We will introduce the Open Archival Information Systems (OAIS) model, and then focus on two key processes: ingestion and access. Our focus in ingestion is on support for review and redaction, which we believe will be key enablers to acquisition and near-term access. For access, we will address both browsing based on provenance (original order) and user-guided reorganization based on search and visualization. Along the way, we will identify potentially productive opportunities to apply natural language processing technologies such as topic segmentation, link detection, and summarization. We will then describe two test collections, and demonstrate a system that we have developed to explore user-guided reorganization through visualization for one of those collections. We will conclude the talk by sketching out a research agenda. At that point, we will expect suggestions and comments from the audience. Knowing this audience, it is unlikely that we will need to wait that long :-).
|
