San Diego State University logo
Seminar in linguistics: Analyzing Web Texts

Linguistics 795

 

Spring 2006
Tue 7:00–9:40
Room AH-2131

Web text, such as blogs, newsgroups, message boards, and email lists, can provide an easily collected and incredibly rich source of data on a nearly limitless range of topics. However, the sheer quantity of data makes comprehensive qualitative analysis impossible, and the nature of web texts present a set of unique challenges for standard computational methods.

In this seminar, we will investigate web texts as a distinct text type (or types), looking at the linguistic and extra-linguistic properties that make them unique. We will also explore some of the data-intensive methods that can be used to extract useful information from large, noisy collections of web texts.

Topics to be may covered include:

  • using internet spiders to collect web texts
  • dealing with the multi-lingual web
  • processing para-linguistic cues like page formatting and multimedia
  • mining web pages for content (sentiment analysis)

The final content of the course will depend on the interests of the students who enroll.

Instructor

Rob Malouf
Office: BA 310A
Office Hours: Tues 4:30–5:30, Thurs 1:00–2:00, or by appointment
Email: rmalouf@mail.sdsu.edu
Phone: (619) 594-7111

Requirements

The goals of this course are for us to gain experience in:

  • exploring the state of the art of linguistically motivated techniques for analyzing web texts,
  • reading and evaluating the primary literature,
  • presenting and discussing research material with peers,
  • identifying open research questions,
  • and designing and carrying out our own experiments.

Through the term, participants (including auditors!) will present and discuss articles from the reading list, which cover a number of aspects of text and web mining.

In addition to leading and participating in discussions, students taking the class for a grade will also prepare a final project. Projects should somehow involve web texts and NLP, but need not be restricted to the methods we cover in class. Ideally, the final project should be something that could be submitted to one of the many computational linguistics conferences.

The final grade will be based on class participation and on a project that applies text mining technology to a useful and interesting problem:

Project proposal (<1 page)Feb 2810%
Annotated bibliographyMarch 2110%
Data setApril 410%
Final projectMay 1850%
Class participation20%

Working in groups (of 2 or 3) is strongly encouraged!

Schedule

Week 1 (1/24) Introduction
slides, handout

Danet, Brenda. 1998. "Computer-Mediated Communication." In Bouissac, P. (ed.), Encyclopedia of Semiotics. New York: Oxford University Press. [Rob]

Herring, Susan C. 2003. "Computer-Mediated Discourse." In D. Tannen, D. Schiffren, and H. Hamilton (eds.), Handbook of Discourse Analysis. Oxford: Blackwell. Pages 612–634. [Rob]

Week 2 (1/31) Qualitative genre studies

Herring, Susan C. 2004. "Computer-Mediated Discourse Analysis: An approach to researching online behavior." in S.A. Barab, R. Kling, and J.H. Gray (eds.), Designing for Virtual Communities in the Service of Learning. New York: Cambridge University Press. Pages 338–376. [Bil]

Yates, JoAnne and Wanda J. Orlikowski. 1993. "Knee-jerk anti-LOOPism and other e-mail phenomena: Oral, written, and electronic patterns in computer-mediated communication." MIT Sloan School Working Paper #3578-93. [Erin]

Herring, Susan C., Scheidt, L. A., Bonus, S., and Wright, E. (2005). "Weblogs as a bridging genre. Information, Technology & People 18(2):142–171. [Jerry]

Week 3 (2/7) Quantitative genre studies

Collot, Milena and Nancy Belmore. 1996. "Electronic language: A new variety of English." In S. Herring (ed.), Computer-Mediated Communication: Linguistic, Social and Cross-Cultural Perspectives. Amsterdam: John Benjamins. Pages 13–28. [Rebecca]

Yates, Simeon J. 1996. "Oral and written aspects of computer conferencing." In S. Herring (ed.), Computer-Mediated Communication: Linguistic, Social and Cross-Cultural Perspectives. Amsterdam: John Benjamins. Pages 29–46. [Emily]

Week 4 (2/14) Virtual communities

Ellis, David, Rachel Oldridge, and Ana Vasconcelos. 2005. "Community and virtual community." Annual Review of Information Science and Technology 38(1):145–186. [Grace]

Burnett, Gary. 2000. "Information exchange in virtual communities: A typology." Information Research 5(4). [Lucien]

Burnett, Gary and Harry Buerkle. 2004. "Information exchange in virtual communities: A comparative study." Journal of Computer-Mediated Communication 9(2). [Paul]

Week 5 (2/21) Resource discovery

Blog search tools: blo.gs, Bloglines, Technorati, NITLE Blog Census, Blogosphere Ecosystem, Feedster, Daypop [Edison]

Govcom.org's Issue Crawler [Eric]

Menczer, Filippo, Gautam Pant, and Padmini Srinivasan. 2004. "Topical web crawlers: Evaluating adaptive algorithms." ACM Transactions on Internet Technology 4(4):378–419. [Bil]

Gary Flake, Steve Lawrence, and C. Lee Giles. 2000. "Efficient identification of web communities." ACM Conference on Knowledge and Data Discovery (KDD 2000), pages 150–160. [Eugenie]

Week 6 (2/28) Website classification

Eric Glover, Gary Flake, Steve Lawrence, William P. Birmingham, Andries Kruger, C. Lee Giles, David M. Pennock. 2001 "Improving category specific web search by learning query modifications." Symposium on Applications and the Internet, SAINT 2001, San Diego. [Paul]

Ghani, Rayid, Rosie Jones, and Dunja Dladenic. 2005. "Building minority language corpora by learning to generate web search queries." Knowledge and Information Systems. 7:56–83. [Heather]

M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam and S. Slattery. 1998. "Learning to extract symbolic knowledge from the World Wide Web." In Proceedings of the 15th National Conference on Artificial Intelligence (AAAI-98). [Erin]

Boese, Elizabeth and Adele E. Howe. 2005. "Effects of web document evolution on genre classification. In ACM Fourteenth Conference on Information and Knowledge Management (CIKM), Bremen, Germany. [Anna]

Week 7 (3/7) Structure analysis

Kushmerick, Nicholas. 2000. "Wrapper induction: Efficiency and expressiveness." Artificial Intelligence 118:15–68. [Eric]

Ikeda, Daisuke, Yasuhiro Yamada and Sachio Hirokawa. 2003. "Expressive power of tree and string based wrappers." In Workshop on Information Integration on the Web. [Lucien]

Yi, Lan, Bing Liu, and Xiaoli Li. 2003. "Eliminating noisy information in web pages for data mining." In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-2003), Washington, DC, USA. [Heather]

Yang, Yudong and Hong Jiang Zhang. 2001. "HTML page analysis based on visual cues." In Sixth International Conference on Document Analysis and Recognition (ICDAR'01). [Grace]

Spring Break (3/14)

Week 8 (3/21) Class cancelled

Week 9 (3/28) AAAI Spring Symposium

Week 10 (4/4) Sentiment analysis

AAAI Spring Symposium report [Rob, Erin]

Nelson, Mark. 1996. "Fast string searching with suffix trees." Dr. Dobb's Journal. [Rob]

Hatzivassiloglou, Vasileios. and Kathleen R. McKeown. 1997. "Predicting the semantic orientation of adjectives." In Proceedings of the 35th Annual Meeting of the ACL. 1997, 174–181. [John]

Turney, Peter D. 2002 "Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews." In Proceedings 40th Annual Meeting of the Association for Computational Linguistics (ACL'02). 417–424. [Edison]

Week 11 (4/11) Subjective language

Pang, Bo, Lillian Lee, and Shivakumar Vaithyanathan. 2002. "Thumbs up? Sentiment classification using machine learning techniques." In Proceedings of EMNLP 2002 79–86. [Emily]

Pang, Bo and Lillian Lee. 2004. "A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd ACL. 271–278. [Jerry]

Mishne, Gilad and Natalie Glance. 2006. "Predicting movie sales from blogger sentiment." In: AAAI 2006 Spring Symposium on Computational Approaches to Analysing Weblogs (AAAI-CAAW 2006). [Rob]

Riloff, Ellen and Janyce Wiebe. 2003. "Learning extraction patterns for subjective expressions." In Proceedings of EMNLP 2003. 105–112. [Eugenie]

Week 12 (4/18) Extensible Markup Language (XML)

Wilson, Theresa, Janyce Wiebe and Rebecca Hwa. 2004. "Just how mad are you? Finding strong and weak opinion clauses." In Proc. 19th National Conference on Artificial Intelligence (AAAI-2004). [Edison]

Wilson, Theresa, Janyce Wiebe and Paul Hoffmann. 2005. "Recognizing contextual polarity in phrase-level sentiment analysis." In Proceedings of Human Language Technologies Conference/Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP 2005). [Rebecca]

TBA [Bil, Rob]

Week 13 (4/25) Ontologies

Hearst, Marti A. 1992. "Automatic acquisition of hyponyms from large text corpora. In Proceedings of the Fourteenth International Conference on Computational Linguistics. [Edison]

Cimiano, Phillip and Steffen Staab. 2004. "Learning by Googling." SIGKDD Explorations 6(2):24–33. [Jerry]

Lin, Dekang. 1998. "Automatic retrieval and clustering of similar words." In Proceedings of COLING/ACL '98, pages 768–774, Montreal, Canada, August. [Paul]

Pantel, Patrick and Deepak Ravichandran. 2004. "Automatically labeling semantic classes." In Proceedings of HLT/NAACL 2004. [Lucien]

Week 14 (5/2) Folksonomies

McGuinness, Deborah L. 2002. "Ontologies come of age." In D. Fensel, et al. (eds.), Spinning the Semantic Web: Bringing the World Wide Web to Its Full Potential. MIT Press. [Eric]

Quintarelli, Emanuele. 2005. "Folksonomies: power to the people." Paper presented at the ISKO Italy-UniMIB meeting, Milan. [Erin]

Brooks, Christopher H. and Nancy Montanez. 2006. "Improved annotation of the blogopshere via autotagging and hierarchical clustering." In Proceedings of the 15th World Wide Web Conference (WWW06), Edinburgh, Scotland. [Anna]

Guy, Marieke and Emma Tonkin. 2006. "Folksonomies: Tidying up tags?" D-Lib Magazine 12(1). [Rebecca]

Week 15 (5/9) Projects

Heather, Edison, Anna, Jerry, Paul

Week 16 (5/16) Projects

Erin, Grace, Eugenie, Lucien, Eric, Rob

Links

rmalouf@mail.sdsu.edu
Last modified: Wed May 10 23:18:45 PDT 2006