大数据分析中的算法 (2016年春季)

基本信息: (教学大纲)
课程代码：00136720 (本科生），00100863 （本研合）
课程内容：侧重数据分析中的数值代数和最优化算法
教师信息：文再文，wenzw@pku.edu.cn
助教信息：户将， jianghu@pku.edu.cn
课程微信群（下面二维码2月29日失效）

上课地点：三教301
上课时间：
- 1~16周每周周二10~12节, 18:40pm-21:30pm
- 答疑时间：单独和老师联系或者每次课后21:30pm-22:00pm
- 北京大学校历
其它信息：
参考材料：
数据分析相关的网站：
课程信息（大致安排）
- Acknowledgement: quite a few materials are taken from slides or lecture notes online.
  Please email me if the usage of any part is not appropriate and they will deleted immediately.
- 第1周，2月23日，课程简介: lecture notes
  线性规划，二次锥规划，半定规划简介: lecture notes
  read: “Convex optimization”, Stephen Boyd and Lieven Vandenberghe, chapters 2 and 3
  思考题：有哪些问题和应用可以化成线性规划，二次锥规划，半定规划？
- 第2周，3月1日，凸优化对偶理论: lecture notes
  线性规划，二次锥规划，半定规划例子: lecture notes
  模型语言： CVX, YALMIP
  LP, SOCP, SDP典型算法软件: SDPT3, MOSEK, CPLEX, GUROBI
  Prof. Boyd lecture notes on Disciplined convex programming and CVX
  “Convex optimization”, Stephen Boyd and Lieven Vandenberghe, chapters 4 and 5
- 第3周，3月8日，线性规划单纯形算法，内点算法: lecture notes
  “Numerical Optimization”, Jorge Nocedal and Stephen Wright, Springer, chapters 13 and 14
- 第4周，3月15日，压缩感知和稀疏优化基本理论: lecture notes
  references: An Introduction to Compressive Sensing
  Compressive Sensing Resources
- 第5周，3月22日，稀疏优化的算法 lecture notes
  “Proximal Algorithms”, N. Parikh and S. Boyd, Foundations and Trends in Optimization, 1(3):123-231, 2014.
  Amir Beck, Marc Teboulle, A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems
  “Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers”, S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, Foundations and Trends in Machine Learning, 3(1):1–122, 2011
- 第6周，3月29日，推荐系统与低秩矩阵恢复的算法 lecture notes
  chapter 9 in “Mining Massive Data Sets”, Stanford University
  “Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization”, Benjamin Recht, Maryam Fazel, Pablo A. Parrilo
- 第7周，4月5日，图和网络流问题: 最短路径问题 lecture notes
  “Network Flows: Theory, Algorithms, and Applications”, Ahuja, Magnanti, and Orlin
- 第8周，4月12日，图和网络流问题: 最大流问题，组合优化与线性规划 lecture notes
  “Network Flows: Theory, Algorithms, and Applications”, Ahuja, Magnanti, and Orlin
- 第9周，4月19日，Submodular 优化与数据挖掘 lecture notes
  Learning with Submodular Functions: A Convex Optimization Perspective, Francis Bach
  Learning with Submodular Functions, Francis Bach
  EE596B - Submodular Functions, Optimization, and Applications to Machine Learning, Prof. Jeff A. Bilmes
- 第10周，4月26日，高维数据降维 lecture notes
  SVD: read section 2.5 in “Matrix Computations”, Gene H. Golub and Charles F. Van Loan, The Johns Hopkins University Press
  A Global Geometric Framework for Nonlinear Dimensionality Reduction, Joshua B. Tenenbaum, Vin de Silva, John C. Langford
  Unsupervised Learning of Image Manifolds by Semidefinite Programming, KILIAN Q. WEINBERGER AND LAWRENCE K. SAUL
  A Duality View of Spectral Methods for Dimensionality Reduction, Lin Xiao, Jun Sun, Stephen Boyd
  Dimensionality Reduction: A Comparative Review, Laurens van der Maaten, Eric Postma, Jaap van den Herik
- 第11周，5月3日，support vector machine lecture notes
  LIBLINEAR: A Library for Large Linear Classification
  LIBSVM: A library for support vector machines
  Working set selection using the second order information for training SVM R.-E. Fan, P.-H. Chen, and C.-J. Lin.
  An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods, CRISTIANINI, Nello, and John SHAWE-TAYLOR, Cambridge, UK: Cambridge University Press
- 第12周，5月10日，链接分析： page rank lecture notes
  chapter 5, link analysis, Mining of Massive Datasets
- 第13周，5月17日，相位恢复以及低温电子显微镜 lecture notes
  E. J. Candes, Y. Eldar, T. Strohmer and V. Voroninski. Phase retrieval via matrix completion. SIAM J. on Imaging Sciences 6(1), 199–225.
  E. J. Candès, X. Li and M. Soltanolkotabi. Phase retrieval via Wirtinger flow: theory and algorithms. IEEE Transactions on Information Theory 61(4), 1985–2007.
  A. Singer, Y. Shkolnisky, Three-Dimensional Structure Determination from Common Lines in Cryo-EM by Eigenvectors and Semidefinite Programming, SIAM Journal on Imaging Sciences, 4 (2), pp. 543-572 (2011).
  L. Wang, A. Singer, Z. Wen, ‘‘Orientation Determination from Cryo-EM images using Least Unsquared Deviations”, SIAM Journal on Imaging Sciences, 6(4), pp. 2450–2483 (2013).
- 第14周，5月24日，随机优化算法, 特征值分解，奇异值分解的随机算法 lecture notes
  Petros Drineas, Ravi Kannan, and Michael W. Mahoney, Fast Monte Carlo Algorithms for Matrices I: Approximating Matrix Multiplication, SIAM J. Comput., 36(1), 132–157
  Petros Drineas, Ravi Kannan, and Michael W. Mahoney, Fast Monte Carlo Algorithms for Matrices II: Computing a Low-Rank Approximation to a Matrix, SIAM J. Comput., 36(1), 158–183
  N. Halko, P. G. Martinsson, and J. A. Tropp, Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions,SIAM Rev., 53(2), 217–288.
- 第15周，5月31日，并行计算、分布式计算 lecture notes
  Some backgrounds on parallel computing can found at Prof. James Demmel's course
  “Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers”, S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, Foundations and Trends in Machine Learning, 3(1):1–122, 2011
  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent, Benjamin Recht, Christopher Re, Stephen Wright, Feng Niu
- 第16周，6月7日，课程项目报告
- 6月11日，晚上 6:30-9:30，课程项目报告
  6月11日地点为理科1号楼1114教室
  分组情况及时间安排请查看此链接 presentation_schedule.xlsx
- 6月21日晚12点（不接受晚交报告），提交课堂报告文件和书面报告
  (包括latex源文件，程序等等打包发email给助教 (pkuopt@163.com, wendouble@gmail.com)。

作业
- 作业一: Convex Optimization 习题： 4.11, 4.12, 4.25, 4.27, Convex Optimization, Additional exercises，习题： 3.11, 3.12, 交作业时间： 3月8日课堂上交（上课前）。
- 作业二: 交作业时间： 3月22日课堂上交（上课前） Convex Optimization 习题： 5.11, 5.17, 5.19, “Numerical Optimization”, Jorge Nocedal and Stephen Wright, Springer，习题： 13.9, 14.6,
- 作业三: homework3.pdf 交作业时间： 4月5日（上课前）。
- 作业四: homework4.pdf 交作业时间： 5月10日（上课前）。
- 作业五: homework5.pdf 交作业时间： 5月17日（上课前）。

课程项目
- 课程项目一: project1.pdf 交作业时间： 4月12日北京时间晚上24点前，不接受迟交作业。
  demo: sparse_l1_example.m
- 课程项目二: 请见下面课程项目二细节
成绩评定
- 大作业，包括习题和程序： 35%
- 课程项目： 65%
- 作业要求：可以同学间讨论或者找助教答疑，但不准抄袭。计算题要求写出必要的推算步骤，证明题要写出关键推理和论证。

课程项目报告文件

课程项目二细节

分组
- 自由组合一般2至3人一组，可以一个人
- 每一个组分配一位指导老师
- 每位指导老师带2到3组学生，大三学生优先选组
选题
- 大致分为文献综述，模型项目，算法项目，理论问题
- 题目可以从 “课程项目文献” 里选择
- 题目也可以自由选择。如果该项目在其它课程里或其他老师组里同时进行，请明确说明。如果该课题是以往工作的延续，请明确说明。
- 文献综述：挑选一篇感兴趣的文献，写文献综述
- 模型项目: 挑选你感兴趣的应用问题，探索用最优化算法或数值代数算法解决。
- 算法项目: 挑选一个或者一类问题，提出新的算法或者已有算法的变形
- 理论问题：分析一些模型或算法的理论性质

指导老师

评分准则
- 评分由三部分组成：中期报告，课程课堂报告，书面报告
- 课程课堂报告：准备15分钟报告 + 5分钟提问。每个组员均需讲差不多长度的时间。成绩评定： 50%
- 书面报告：每一组提交一份ppt （课堂报告文件）和一份读书报告 (包括latex源文件 )。如果文章中有数值算例，请写程序复制数值实验结果。成绩评定：占该项目的 50%

重要日期和提交内容：
- 3月15日前，选择指导老师, 具体课题在中期报告提交前还可以修改,
- 分组情况请查看此文件
- 4月19日晚12点（不接受晚交报告）, 中期报告请于前发email给助教 (pkuopt@163.com, wendouble@gmail.com)
  中期报告不超过3页纸（单面）。需要描述你们组的选题，已经完成的工作，你们将要做哪些事情
  提交的文件名为 “proj2rep1-name1-name2.pdf”. 比如如果成员为Pengyu Qian 和 Junzi Zhang，则文件名为 proj2rep1-PengyuQian-JunziZhang.pdf
- 6月11日，课堂报告.
  分组情况及时间安排请查看此链接 presentation_schedule.xlsx
- 6月21日晚12点（不接受晚交报告），课堂报告文件和书面报告 (包括latex源文件，程序等等打包发email给助教 (pkuopt@163.com, wendouble@gmail.com)。
  提交的文件请全部打包，文件名为 “proj2-name1-name2.zip”. 比如如果成员为Pengyu Qian 和 Junzi Zhang，则文件名为 proj2-PengyuQian-JunziZhang.zip

大数据分析中的算法 (2016年春季)

课程项目二细节

课程项目文献