Healthcare Data Analytics Challenge


There are many patient/caregiver support forums where patients/caregivers can post their questions regarding their health conditions. Some examples include cancer compass, ehealthforums , patientslikeme, etc. In many of these forums, there is a significant degree of repetitive questions. It is quite common in patient support forums to find questions that are very similar in meaning (i.e., have high similarity in terms of their semantics) but are worded differently. One possible reason of this repetition could be that as the forums grow longer, patients/caregivers do not have the time or patience to read through previous questions before posing their own question.

In this context, a system that can point a patient/caregiver posting a question on a support forum to semantically similar questions that were previously posted on the forum will be immensely helpful to patients and caregivers. This challenge aims to design foundational techniques towards building such a system.

Challenge Specification

Consider a corpus of questions Q= {q1, q2, q3, …. , qn} from a patient support forum, where each qi represents a question from a patient forum on Type II Diabetes. For simplicity, all the questions in the corpus are assumed to pertain to the same disease/chronic health condition (e.g., diabetes). Furthermore, each question is assumed to have a unique ID. The order of questions in the corpus is assumed to be arbitrary. Suppose IQ = {iq1, iq2, …., iqm} be the set of incoming questions (iq1 being the first incoming question and so on).

The challenge is to design and implement a system that for each incoming query iqj identifies a maximum of three most similar questions from the corpus Q. In other words, for each iqj, your system has to retrieve three or less queries from Q that are semantically most similar to iqj. For each iqj, the output from your system should be a set of three or less question IDs from Q. It is not necessary to rank the retrieved results (i.e., the systems will only be evaluated for containment and not for ranking of the results).

You can use any external data in your approach (ontologies etc.) provided you reveal all the data sources used in your approach.

Test Data Sets and System Evaluation

Systems submitted for this challenges will be evaluated and compared with respect to their effectiveness in retrieving most similar queries from the corpus. Although the systems will not be evaluated for speed, they are expected to complete execution in a reasonable time frame (~5 minutes for a data set with |Q|≈ 100 and |IQ| ≈ 10).

The evaluation will be done by using domain expert-curated datasets. The query corpuses Q and the incoming query set IQ will be derived from real patient support forums on Type II Diabetes. For each query in IQ, domain experts will identify up to three most similar queries from Q. Systems will be evaluated based on the percentage of expert-identified results that are included in the system-identified results set. For instance, let us suppose the domain expert identified q6 and q37 as the most similar queries to the incoming query iq1 and suppose a system identified q37, q52 and q74 as the most relevant queries. This system will receive 50 points. The system with maximum number of total points (over all queries in IQ) will win.


  1. Each participating group can have at most 5 people.
  2. Each person can be part of no more than 2 groups
  3. System can be developed in one of the standard programming language (C, C++, Java, Python). If you intend to use any other language, you will need to obtain prior permission from the ICHI 2015 challenge chair.
  4. You can use databases, ontologies, dictionaries, etc. in your system provided you all reveal all such knowledgebases that you have used.
  5. We will make one dataset available on our website by June 10, 2015. You can use this data set for development, tuning and testing of your system. This dataset will be used in determining the finalists of the grand challenge (see below).
  6. You will need to submit the following as a part of the package: (1) A paper (two pages IEEE conference proceedings format) outlining your approach; (2) Source code of your system; (3) Any databases, Ontologies etc. that are required for your system to function; (4) A readme file containing the names of the group members and clear instructions on how to compile, deploy and run your system; (5) The results of your system on the dataset that is made available on the website;
  7. Based on the results on the dataset that is made available, we will select up to six finalists. The finalists are expected to demo their system during a dedicated session in the conference and give a 5-minute presentation outlining their approach. The papers from the finalist groups will be included in the conference proceedings. At least one member of each group is required to register for the conference and be present at the conference.
  8. During the conference, the finalist systems will be evaluated on a data set that has not been shared with the participants before. Results from this data set will decide the final winners of the challenge.

Important Dates:

Grand Challenge Solution Submission Deadline: July 10, 2015. July 17, 2015

Finalist Decision Notification: August 10, 2015.

Camera Ready Papers Due from Finalists: August 15, 2015. August 21, 2015

Please submit electronically via the EasyChair system at (Please select the track for Data Analytics Challenge.)

Please contact the Healthcare Data Analytics Challenge Chair – Dr. Lakshmish Ramaswamy ( if you have any questions or need additional information about the challenge.

ICHI 2015 © 2015