Hua Wu


Chair of Baidu Technical Committee

Email: wu_hua@baidu.com
Address: Baidu Technology Park Building No. 1, No. 10 Xibeiwang East Road, Haidian District, Beijing, 100093, China

I joined Baidu in 2010. Now I am the technical leader of Baidu NLP department and knowledge graph department. Before that, I worked for Toshiba (China) R&D Center and Microsoft Research Asia (MSRA). I obtained Ph.D. degree in pattern recognition and intelligent system from the Institute of Automation, Chinese Academy of Science in 2001.

My research interest includes dialogue systems, machine translation, natural language processing and knowledge graph.

News

  • We are hiring (both interns and employees)! Please drop me an email with your resume if you are interested in working with us on NLP problems, including but not limited to Dialogue Systems, Machine Translation, Question Answering, Distributed Representation, Generation, Knowledge Graph. Experiences with machine (incl. but not limited to deep) learning for NLP are preferred.
  • We are organizing the Workshop on Simultaneous Translation (2021, 2020), where there is a shared task on Chinese-English and English-Spanish simultaneous translation.
  • Our PLATO-2 model was ranked top 1 at DSTC9 tracks 1, 2 and 3.
  • We launched LUGE (Language Understanding and Generation Evaluation Benchmarks ) on Chinese NLP, which aims to provide researchers with various kinds of data sets and evaluations, and jointly promote the progress of Chinese NLP technology. A recent introduction on this is available here (In Chinese). If you are interested in LUGE or sharing data sets, pls. contact me.

Professional Activities

  • Program co-chair of AACL 2020, ACL 2014
  • Area chairs or SPC of ACL, IJCAI and AAAI
  • Co-organize the first Workshop on Automatic Simultaneous Translation 2020
  • Co-organize the ICDAR Workshop of Document Image and Language 2021

Research

Open-Domain Dialogue Systems

The aim of the open domain dialogue system is let the machines capable of chatting, answering question and completing tasks, as well as the ability of rapid learning and continuous evolution. Its core competencies are as follows:
  • Understanding: understand natural languages
  • Expression: express in fluent natural languages
  • Emotion: understand emotions and respond with appropriate emotions
  • Thinking: Context-based calculation, reasoning and decision making
  • Learning: Capable of learning and evolution
It is not easy to make such a system come true. There are several fundamental problems to be solved: dialogue-oriented knowledge representation, knowledge-grounded policy learning, knowledge-grounded response generation. In order to approach this target, we have conducted some researches:
  • Large-scale pre-trained response generation model
    Based on the available large-scale open-domain conversation, we pre-trained a response generation model PLATO-2 via curriculum learning. We have released our English models and source codes at Github. PLATO-2 was ranked top 1 at DSTC 9 Track 1, Track2, and Track 3 shared tasks.
  • Knowledge-grounded policy learning and response generation
    we leverage graphs to guide policy learning. Different kinds of graphs are used including knowledge graphs, conversation graphs constructed from query logs, event graphs constructed from stories. Several papers were published in AAAI 2020, ACL 2020, IJCAI 2020.
  • Datasets for knowledge-grounded dialogue system
    DuCov: This corpus is designed to facilitate the researches towards building a human-like conversational agent: endowing it with the ability of proactively leading the conversation. In DuConv, one acts as a conversation leader and the other acts as the follower. The leader is provided with a knowledge graph and asked to sequentially change the discussion topics, following the given conversation goal, and meanwhile keep the dialogue as natural and engaging as possible. DuConv enables a very challenging task as the model needs to both understand dialogue and plan over the given knowledge graph. This dataset contains about 270K utterances and 30k dialogues.

    DuRecDial: This corpus is designed to facilitate conversational recommendation over multi-type dialogs, where the bots can proactively and naturally lead a conversation from a non-recommendation dialog (e.g., QA) to a recommendation dialog, considering user’s interests and feedback. DuRecDial contains about 10k dialogs, 156k utterances. In each dialog, the recommender proactively leads a multi-type dialog to approach recommendation targets and then makes multiple recommendations with rich interaction behavior. This dataset allows us to systematically investigate different parts of the overall problem, e.g., how to naturally lead a dialog, how to interact with users for recommendation.

Machine Translation

Since 2010, we have been working on an online machine translation product named Baidu Translate, which translates among 203 languages. In 2011, we launched the statistical machine translation service. In May, 2015, we launched the world’s first neural machine translation service. Besides text translation, Baidu Translate supports speech-to-speech translation, simultaneous translation, and OCR/image translation.
  • Simultaneous Translation
    We co-organized the first Workshop on Automatic Simultaneous Translation 2020, where we release the first Chinese-English simultaneous translation dataset, which contains about 70 hours of Chinese speech audio, human transcripts, ASR results and English translations. In order to make tradeoff between translation quality and translation efficiency, we proposed several methods including wait-k and adaptive meaningful units segmentation method.
  • Multilingual Translation
    For most of language pairs such as Chinese-Spanish, Chinese-Japanese, Chinese-Thai Language, there exists data sparseness problems. Besides pivot language approaches, we proposed the one to many translation method in 2015, which shares the source language encode, and use individual decodes for each target language.

Pre-trained Model: ERNIE

    Recently pre-trained models have achieved state-of-the-art results in various language understanding tasks. In order to extract the lexical, syntactic and semantic information from training corpora, we propose a continual pre-training framework named ERNIE 2.0 which incrementally builds pre-training tasks and then learn pre-trained models on these constructed tasks via continual multi-task learning. Based on this framework, we construct several tasks and train the ERNIE 2.0 model to capture lexical, syntactic and semantic aspects of information in the training data. Experimental results demonstrate that ERNIE 2.0 outperforms BERT and XLNet on 16 tasks including English tasks on GLUE benchmarks and several similar tasks in Chinese. The source codes and pre-trained models have been released.

Question Answering and Machine Reading Comprehension

    We developed Question Answering and Machine Reading Comprehension methods, which are used in Baidu search engine. Recently, we proposed RocketQA, an optimized training approach to dense passage retrieval for open-domain question answering. RocketQA achieved the 1st rank at the leaderboard of MSMARCO Passage Ranking Task. We released a Chinese dataset namely DuReaderrobust towards evaluating the robustness of machine reading comprehension models, and we hosted a shared task based on DuReaderrobust [Data&Code, Leaderboard].

Papers [Google Scholar]

Last updated Dec 20 2020 (This template was originally designed by Mu Li.)