Assignment 2

Introduction

In this part you will get familiar with the evaluation of information retrieval by calculating the precision and recall of queries and comparing different techniques used in the search methods. Each group is given an information retrieval task to a document collection collected from the ACM digital library database. You are also given a parser for handling the document collection.

Task

You have to compare at least two of the following ranking methods, e.g.: VSM vs. BM25:

Note that all these ranking methods are already implemented in Lucene. For the comparison you must also examine the effect of using morphological analysis or stop words with the ranking methods. Make experiments using at least six different combinations, e.g.:

  • VSM with Porter stemmer and stop words
  • VSM with Porter stemmer and no stop words
  • BM25 with Porter stemmer and stop words
  • VSM with some other stemmer and stop words
  • etc. 

When you have a comparison scenario, you should implement a Java program which uses Lucene to perform the search with the given document collection and your search task (the program from part 1 might be of help here). Each comparison scenario has corresponding number in the document collection file, for example comparison scenario 1: Motion control, has all relevant documents marked with the search task number 1. Focus on items that are marked with your search task number. If you want to do extra analysis, or you have problems in finding differences between the ranking methods, you can also run queries on the whole document collection.

Documents in the document collection are presented as <item>-elements and the actual query that has been used to find the document from the ACM-database is presented with the <query>-element. Queries in the <query>-elements should be good for document ranking, but must come up with some queries of your own. You should use at least 2-3 queries in your comparison so that you can average over your results. Relevance of the document to the given comparison scenario is also pre-evaluated and presented as boolean in the <relevance>-element.

The program should contain a main method which performs the indexing the abstracts of document collection, the search and prints the results in the standard output. The path to the XML file containing the document collection is passed to the main method as a command-line argument. You can decide whether to use a file- or memory-based search index. The queries for the search task can be hard-coded in the program.

Based on the results you get from the program (e.g. a ranked search result lists) you are supposed to produce averaged 11-step precision recall curves for each of the compared combinations. For this you need to study the lecture 6 slides and the course book chapter about the evaluation of ranked retrieval results. Because there are many combinations, please draw multiple curves per figure.

Feel free to also use other statistical/quantitative methods to analyze the data as you see fit. You should also conduct some qualitative analysis, e.g. can the search results be characterized somehow, did the document collection contain some anomalies, why the chosen methods produced good/bad results, did the ranking work.


The course has a FAQ regarding the course assignments. You can post questions straight to the FAQ or send them to the course email from which we will add them to the wiki.

Tools

Refer to assignment 1 instructions for Lucene related information.

Report

Finally, you must report your work in a 8-12 page document, using the Aalto LaTeX template. Use the publication type "report". The report can be written in English or Finnish and must include at least the following parts:

  • course code and course name, topic of the work, names and student IDs of the group members
  • introduction
  • description of the compared techniques in the search method
  • evaluation of the information retrieval with the compared techniques (precision recall curves, etc.)
  • conclusions / discussion
  • references
  • description of the contributions of each group member: did you have specific roles in the collaboration, which tasks did you do together, and which tasks were divided to each member to be done individually

Submission

When you are finished: maximum number of uploaded files is set to one, so zip up your Java project and add the report as a PDF file to the zip archive. If your code uses external libraries other than the ones linked to in this or previous parts, you should also attach them. Only one group member should return the report.

Late submissions: minus 1 point / day. Force majeure occurrencies: contact the course mailing list.

Grading

The grading is based on the quality of the report, correctness of the results, the quality of the program code (e.g. structure of your program, code comments) and the overall impression. It's not necessary to evaluate extra scenarios or use extra statistical/quantitative methods to get good or maximum points, but of course it can be easier to draw conclusions and write the discussion part of report if you do some extra things. Focus your effort on producing correct results, and documenting it well. The report should tell a consistent story of what you are trying to do, why, how you did it, what is the result and which are your conclusions.