عنوان

Convex Approaches to Text Summarization

پدید آورنده

Gawalt, Brian

موضوع

رده

کتابخانه

Center and Library of Islamic Studies in European Languages

محل استقرار

استان: Qom ـ شهر: Qom

تماس با کتابخانه : 32910706-025

NATIONAL BIBLIOGRAPHY NUMBER

Number

TL5px8t6xj

LANGUAGE OF THE ITEM

.Language of Text, Soundtrack etc

انگلیسی

TITLE AND STATEMENT OF RESPONSIBILITY

Title Proper

Convex Approaches to Text Summarization

General Material Designation

[Thesis]

First Statement of Responsibility

Gawalt, Brian

Subsequent Statement of Responsibility

El Ghaoui, Laurent

.PUBLICATION, DISTRIBUTION, ETC

Date of Publication, Distribution, etc.

2012

DISSERTATION (THESIS) NOTE

Body granting the degree

El Ghaoui, Laurent

Text preceding or following the note

2012

SUMMARY OR ABSTRACT

Text of Note

This dissertation presents techniques for the summarization and exploration of text documents. Many approaches taken towards analysis of news media can be analogized to well-defined, well-studied problems from statistical machine learning. The problem of feature selection, for classification and dimensionality reduction tasks, is formulated to help assist with these media analysis tasks. Taking advantage of L1 regularization, convex programs can be used to efficiently solve these feature selection problems efficiently. There is a demonstrated potential to conduct media analysis at a scale commensurate with the growing volume of data available to news consumers. There is first a presentation of an example text mining over a vector space model. Given two news articles on a related theme, a series of additional articles are pulled from a large pool of candidates to help link these two input items. The novel algorithm used is based on finding the documents whose vector representations are nearest the convex combinations of the inputs. Comparisons to competing algorithms show performance matching a state-of-the-art method, at a lower computational complexity. Design of a relational database for typical text mining tasks is discussed. The architecture trades off the organizational and data quality advantages of normalization versus the performance boosts from replicating entity attributes across tables. The vector space model of text is implemented explicitly as a three-column table. The predictive framework, connecting news analysis tasks to feature selection and classification problems, is then explicitly explored. The validity of this analogy is tested with a particular task: given a query term and a corpus of news articles, provide a short list of word tokens which distinguish how this word appears within the corpus. Example summary lists were produced by five algorithms, and presented to volunteer readers. Evidence suggests that an implementation of L1-regularized logistic regression model, trained over the documents with labels indicating the presence or absence of the query word, selected word-features best summarizing the query. To contend with tasks that do not lend themselves this a predictive framework, a sparse variant of latent semantic indexing is investigated. [cont.]

PERSONAL NAME - PRIMARY RESPONSIBILITY

Wyman, Dana Elizabeth

PERSONAL NAME - SECONDARY RESPONSIBILITY

Gawalt, Brian

CORPORATE BODY NAME - SECONDARY RESPONSIBILITY

UC Berkeley

ELECTRONIC LOCATION AND ACCESS

Electronic name