{"title": "Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 11674, "page_last": 11685, "abstract": "Stochastic gradient descent with a large initial learning rate is widely used for training modern neural net architectures. Although a small initial learning rate allows for faster training and better test performance initially, the large learning rate achieves better generalization soon after the learning rate is annealed. Towards explaining this phenomenon, we devise a setting in which we can prove that a two layer network trained with large initial learning rate and annealing provably generalizes better than the same network trained with a small learning rate from the start. The key insight in our analysis is that the order of learning different types of patterns is crucial: because the small learning rate model first memorizes low-noise, hard-to-fit patterns, it generalizes worse on hard-to-generalize, easier-to-fit patterns than its large learning rate counterpart. This concept translates to a larger-scale setting: we demonstrate that one can add a small patch to CIFAR-10 images that is immediately memorizable by a model with small initial learning rate, but ignored by the model with large learning rate until after annealing. Our experiments show that this causes the small learning rate model's accuracy on unmodified images to suffer, as it relies too much on the patch early on.", "full_text": "TowardsExplainingtheRegularizationEffectofInitialLargeLearningRateinTrainingNeuralNetworksYuanzhiLiMachineLearningDepartmentCarnegieMellonUniversityyuanzhil@andrew.cmu.eduColinWeiComputerScienceDepartmentStanfordUniversitycolinwei@stanford.eduTengyuMaComputerScienceDepartmentStanfordUniversitytengyuma@stanford.eduAbstractStochasticgradientdescentwithalargeinitiallearningrateiswidelyusedfortrainingmodernneuralnetarchitectures.Althoughasmallinitiallearningrateallowsforfastertrainingandbettertestperformanceinitially,thelargelearningrateachievesbettergeneralizationsoonafterthelearningrateisannealed.Towardsexplainingthisphenomenon,wedeviseasettinginwhichwecanprovethatatwolayernetworktrainedwithlargeinitiallearningrateandannealingprovablygeneralizesbetterthanthesamenetworktrainedwithasmalllearningratefromthestart.Thekeyinsightinouranalysisisthattheorderoflearningdifferenttypesofpatternsiscrucial:becausethesmalllearningratemodel\ufb01rstmemorizeseasy-to-generalize,hard-to-\ufb01tpatterns,itgeneralizesworseonhard-to-generalize,easier-to-\ufb01tpatternsthanitslargelearningratecounterpart.Thisconcepttranslatestoalarger-scalesetting:wedemonstratethatonecanaddasmallpatchtoCIFAR-10imagesthatisimmediatelymemorizablebyamodelwithsmallinitiallearningrate,butignoredbythemodelwithlargelearningrateuntilafterannealing.Ourexperimentsshowthatthiscausesthesmalllearningratemodel\u2019saccuracyonunmodi\ufb01edimagestosuffer,asitreliestoomuchonthepatchearlyon.1IntroductionItisacommonlyacceptedfactthatalargeinitiallearningrateisrequiredtosuccessfullytrainadeepnetworkeventhoughitslowsdownoptimizationofthetrainloss.Modernstate-of-the-artarchitecturestypicallystartwithalargelearningrateandannealitatapointwhenthemodel\u2019s\ufb01ttothetrainingdataplateaus[25,32,17,42].Meanwhile,modelstrainedusingonlysmalllearningrateshavebeenfoundtogeneralizepoorlydespiteenjoyingfasteroptimizationofthetrainingloss.Anumberofpapershaveproposedexplanationsforthisphenomenon,suchassharpnessofthelocalminima[22,20,24],thetimeittakestomovefrominitialization[18,40],andthescaleofSGDnoise[38].However,westillhavealimitedunderstandingofasurprisingandstrikingpartofthelargelearningratephenomenon:fromlookingatthesectionoftheaccuracycurvebeforeannealing,itwouldappearthatasmalllearningratemodelshouldoutperformthelargelearningratemodelinbothtrainingandtesterror.Concretely,inFig.1,themodeltrainedwithsmalllearningrateoutperforms33rdConferenceonNeuralInformationProcessingSystems(NeurIPS2019),Vancouver,Canada.\fFigure1:CIFAR-10accuracyvs.epochforWideResNetwithweightdecay,nodataaugmen-tation,andinitiallrof0.1vs.0.01.Grayrepresentstheanneal-ingtime.Left:Train.Right:Validation.thelargelearningrateuntilepoch60whenthelearningrateis\ufb01rstannealed.Onlyafterannealingdoesthelargelearningratevisiblyoutperformthesmalllearningrateintermsofgeneralization.Inthispaper,weproposetotheoreticallyexplainthisphenomenonviatheconceptoflearningorderofthemodel,i.e.,theratesatwhichitlearnsdifferenttypesofexamples.Thisisnotatypicalconceptinthegeneralizationliterature\u2014learningorderisatraining-timepropertyofthemodel,butmostanalysesonlyconsiderpost-trainingpropertiessuchastheclassi\ufb01er\u2019scomplexity[8],orthealgorithm\u2019soutputstability[9].Wewillconstructasimpledistributionforwhichthelearningorderofatwo-layernetworktrainedunderlargeandsmallinitiallearningratesdeterminesitsgeneralization.Informally,consideradistributionovertrainingexamplesconsistingoftwotypesofpatterns(\u201cpattern\u201dreferstoagroupingoffeatures).The\ufb01rsttypeconsistsofasetofeasy-to-generalize(i.e.,discrete)patternsoflowcardinalitythatisdif\ufb01cultto\ufb01tusingalow-complexityclassi\ufb01er,buteasilylearnableviacomplexclassi\ufb01erssuchasneuralnetworks.Thesecondtypeofpatternwillbelearnablebyalow-complexityclassi\ufb01er,butareinherentlynoisysoitisdif\ufb01cultfortheclassi\ufb01ertogeneralize.Inourcase,thesecondtypeofpatternrequiresmoresamplestocorrectlylearnthanthe\ufb01rsttype.Supposewehavethefollowingsplitofexamplesinourdataset:20%containingonlyeasy-to-generalizeandhard-to-\ufb01tpatterns20%containingonlyhard-to-generalizeandeasy-to-\ufb01tpatterns60%containingbothpatterntypes(1.1)Thefollowinginformaltheoremscharacterizethelearningorderandgeneralizationofthelargeandsmallinitiallearningratemodels.Theyareadramaticsimpli\ufb01cationofourTheo-rems3.4and3.5meantonlytohighlighttheintuitionsbehindourresults.Theorem1.1(Informal,largeinitialLR+anneal).ThereisadatasetwithsizeNoftheform(1.1)suchthatwithalargeinitiallearningrateandnoisygradientupdates,atwolayernetworkwill:1)initiallyonlylearnhard-to-generalize,easy-to-\ufb01tpatternsfromthe0.8Nexamplescontainingsuchpatterns.2)learneasy-to-generalize,hard-to-\ufb01tpatternsonlyafterthelearningrateisannealed.Thus,themodellearnshard-to-generalize,easily\ufb01tpatternswithaneffectivesamplesizeof0.8Nandstilllearnsalleasy-to-generalize,hardto\ufb01tpatternscorrectlywith0.2Nsamples.Theorem1.2(Informal,smallinitialLR).Inthesamesettingasabove,withsmallinitiallearningratethenetworkwill:1)quicklylearnalleasy-to-generalize,hard-to-\ufb01tpatterns.2)ignorehard-to-generalize,easily\ufb01tpatternsfromthe0.6Nexamplescontainingbothpatterntypes,andonlylearnthemfromthe0.2Nexamplescontainingonlyhard-to-generalizepatterns.Thus,themodellearnshard-to-generalize,easily\ufb01tpatternswithasmallereffectivesamplesizeof0.2Nandwillperformrelativelyworseonthesepatternsattesttime.Together,thesetwotheoremscanjustifythephenomenonobservedinFigure1asfollows:inareal-worldnetwork,thelargelearningratemodel\ufb01rstlearnshard-to-generalize,easier-to-\ufb01tpatternsandisunabletomemorizeeasy-to-generalize,hard-to-\ufb01tpatterns,leadingtoaplateauinaccuracy.Oncethelearningrateisannealed,itisableto\ufb01tthesepatterns,explainingthesuddenspikeinbothtrainandtestaccuracy.Ontheotherhand,becauseofthelowamountofSGDnoisepresentineasy-to-generalize,hard-to-\ufb01tpatterns,thesmalllearningratemodelquicklyover\ufb01tstothembeforefullylearningthehard-to-generalizepatterns,resultinginpoortesterroronthelattertypeofpattern.2\fBothintuitivelyandinouranalysis,thenon-convexityofneuralnetsiscrucialforthelearning-ordereffecttooccur.Stronglyconvexproblemshaveauniqueminimum,sowhathappensduringtrainingdoesnotaffectthe\ufb01nalresult.Ontheotherhand,weshowthenon-convexitycausesthelearningordertohighlyin\ufb02uencethecharacteristicsofthesolutionsfoundbythealgorithm.InSectionF.1,weproposeamitigationstrategyinspiredbyouranalysis.InthesamesettingasTheorems1.1and1.2,weconsidertrainingamodelwithsmallinitiallearningratewhileaddingnoisebeforetheactivationswhichgetsreducedbysomeconstantfactoratsomeparticularepochintraining.Weshowthatthisalgorithmprovidesthesametheoreticalguaranteesasthelargeinitiallearningrate,andweempiricallydemonstratetheeffectivenessofthisstrategyinSection6.InSection6wealsoempiricallyvalidateTheorems1.1and1.2byaddinganarti\ufb01cialmemorizablepatchtoCIFAR-10images,inamannerinspiredby(1.1).1.1RelatedWorkThequestionoftrainingwithlargerbatchsizesiscloselytiedwithlearningrate,andmanypapershaveempiricallystudiedlargebatch/smallLRphenomena[22,18,35,34,11,41,16,38],particularlyfocusingonvisiontasksusingSGDastheoptimizer.1Keskaretal.[22]arguethattrainingwithalargebatchsizeorsmalllearningrateresultsinsharplocalminima.Hofferetal.[18]proposetrainingthenetworkforlongerandwithlargerlearningrateasawaytotrainwithalargerbatchsize.Wenetal.[38]proposeaddingFishernoisetosimulatetheregularizationeffectofsmallbatchsize.Adaptivegradientmethodsareapopularmethodfordeeplearning[14,43,37,23,29]thatadaptivelychoosedifferentstepsizesfordifferentparameters.Onemotivationforthesemethodsisreducingtheneedtotunelearningrates[43,29].However,thesemethodshavebeenobservedtohurtgeneralizationperformance[21,10],andmodernarchitecturesoftenachievethebestresultsviaSGDandhand-tunedlearningrates[17,42].Wilsonetal.[39]constructatoyexampleforwhichADAM[23]generalizesprovablyworsethanSGD.Additionally,thereareseveralalternativelearningrateschedulesproposedforSGD,suchaswarm-restarts[28]and[33].Geetal.[15]analyzetheexponentiallydecayinglearningrateandshowthatits\ufb01naliterateachievesoptimalerrorinstochasticoptimizationsettings,buttheyonlyanalyzeconvexsettings.Therearealsoseveralrecentworksonimplicitregularizationofgradientdescentthatestablishconvergencetosomeidealizedsolutionunderparticularchoicesoflearningrate[27,36,1,7,26].Incontrasttoouranalysis,thegeneralizationguaranteesfromtheseworkswoulddependonlyonthecomplexityofthe\ufb01naloutputandnotontheorderoflearning.Otherrecentpapershavealsostudiedtheorderinwhichdeepnetworkslearncertaintypesofexamples.MangalamandPrabhu[30]andNakkiranetal.[31]experimentallydemonstratethatdeepnetworksmay\ufb01rst\ufb01texampleslearnableby\u201csimpler\u201dclassi\ufb01ers.Forourconstruction,weprovethattheneuralnetwithlargelearningratefollowsthisbehavior,initiallylearningaclassi\ufb01eronlinearlyseparableexamplesandlearningtheremainingexamplesafterannealing.However,thephenomenonthatweanalyzeisalsomorenuanced:withasmalllearningrate,weprovethatthemodel\ufb01rstlearnsacomplexclassi\ufb01eronlow-noiseexampleswhicharenotlinearlyseparable.Finally,ourprooftechniquesandintuitionsarerelatedtorecentliteratureonglobalconvergenceofgradientdescentforover-parametrizednetworks[6,12,13,1,5,7,4,26,2].Theseworksshowthatgradientdescentlearnsa\ufb01xedkernelrelatedtotheinitializationundersuf\ufb01cientover-parameterization.Inouranalysis,theunderlyingkernelischangingovertime.TheamountofnoiseduetoSGDgovernsthespaceofpossiblelearnedkernels,andasaresult,regularizestheorderoflearning.2SetupandNotationsDatadistribution.Weformallyintroduceourdatadistribution,whichcontainsexamplessupportedontwotypesofcomponents:aPcomponentmeanttomodelhard-to-generalize,easier-to-\ufb01tpatterns,andaQcomponentmeanttomodeleasy-to-generalize,hard-to-\ufb01tpatterns(seethediscussioninourintroduction).Formally,weassumethatthelabelyhasauniformdistributionover{\u22121,1},andthe1Whilethesepapersareframedasastudyoflarge-batchtraining,anumberofthemexplicitlyacknowledgetheconnectionbetweenlargebatchsizeandsmalllearningrate.3\fFigure2:Avisualizationofthevectorsz,z\u2212\u03b6,andz+\u03b6usedtode\ufb01nethedistributionQin2dimensions.z\u00b1\u03b6willhavelabel\u22121andzhaslabel+1.Notethatthenormof\u03b6ismuchsmallerthanthenormofz.dataxisgeneratedasConditionedonthelabely(2.1)withprobabilityp0,x1\u223cPy,andx2=0(2.2)withprobabilityq0,x1=0,andx2\u223cQy(2.3)withprobability1\u2212p0\u2212q0,x1\u223cPy,andx2\u223cQy(2.4)whereP\u22121,P1areassumedtobetwohalfGaussiandistributionswithamargin\u03b30betweenthem:x1\u223cP1\u21d4x1=\u03b30w?+z|hw?,zi\u22650,wherez\u223cN(0,Id\u00d7d/d)x1\u223cP\u22121\u21d4x1=\u2212\u03b30w?+z|hw?,zi\u22640,wherez\u223cN(0,Id\u00d7d/d)Therefore,weseethatwhenx1ispresent,thelinearclassi\ufb01ersign(w?>x1)canclassifytheexamplecorrectlywithamarginof\u03b30.Tosimplifythenotation,weassumethat\u03b30=1/\u221adandw?\u2208Rdhasaunit\u20182norm.Intuitively,Pislinearlyseparable,thuslearnablebylowcomplexity(e.g.linear)classi\ufb01ers.However,becauseofthedimensionality,Phashighnoiseandrequiresarelativelylargesamplecomplexitytolearn.ThedistributionQ\u22121andQ1aresupportedonlyonthreedistinctdirectionsz\u2212\u03b6,zandz+\u03b6withsomerandomscaling\u03b1,andarethuslow-noiseandmemorizable.Concretely,z\u2212\u03b6andz+\u03b6havenegativelabelsandzhaspositivelabels.x2\u223cQ1\u21d4x2=\u03b1zwith\u03b1\u223c[0,1]uniformlyx2\u223cQ\u22121\u21d4x2=\u03b1(z+b\u03b6)with\u03b1\u223c[0,1],b\u223c{\u22121,1}uniformly(2.5)Hereforsimplicity,wetakeztobeaunitvectorinRd.Weassume\u03b6\u2208Rdhasnormk\u03b6k2=randhz,\u03b6i=0.Wewillassumer(cid:28)1sothatz+\u03b6,z,z\u2212\u03b6arefairlyclosetoeachother.Wedepictz\u2212\u03b6,z,z+\u03b6inFigure2.WechoosethistypeofQtobetheeasy-to-generalize,hard-to-\ufb01tpattern.Notethatzisnotlinearlyseparablefromz+\u03b6,z\u2212\u03b6,sonon-linearityisnecessarytolearnQ.Ontheotherhand,itisalsoeasyforhigh-complexitymodelssuchasneuralnetworkstomemorizeQwithrelativelysmallsamplecomplexity.MemorizingQwithatwo-layernet.Itiseasyforatwo-layerrelunetworktomemorizethelabelsofx2usingtwoneuronswithweightsw,vsuchthathw,zi<0,hw,z\u2212\u03b6i>0anhv,zi<0,hv,z+\u03b6i>0.Inparticular,wecanverifythat\u2212hw,x2i+\u2212hv,x2i+willoutputanegativevalueforx2\u2208{z\u2212\u03b6,z+\u03b6}andazerovalueforx2=z.Thuschoosingasmallenough\u03c1>0,theclassi\ufb01er\u2212hw,x2i+\u2212hv,x2i++\u03c1givesthecorrectsignforthelabely.WeassumethatwehaveatrainingdatasetwithNexamples{(x(1),y(1)),\u00b7\u00b7\u00b7,(x(N),y(N))}drawni.i.dfromthedistributiondescribedabove.Weusepandqtodenotetheempiricalfractionofdatapointsthataredrawnfromequation(2.2)and(2.3).Two-layerneuralnetworkmodel.Wewilluseatwo-layerneuralnetworkwithreluactivationtolearnthedatadistributiondescribedabove.The\ufb01rstlayerweightsaredenotedbyU\u2208Rm\u00d72dandthesecondlayerweightisdenotedbyu\u2208Rm.Withreluactivation,theoutputoftheneuralnetworkisu>(1(Ux)(cid:12)Ux)where(cid:12)denotestheelement-wisedotproductoftwovectorsand1(z)isthebinaryvectorthatcontains1(zi\u22650)asentries.ItturnsoutthatwewilloftenbeconcernedwiththeobjectthatdisentanglesthetwooccurrencesofUintheformulau>(1(Ux)(cid:12)Ux).Wede\ufb01nethefollowingnotationtofacilitatethereferencetosuchanobject.LetNA(u,U;x),w>(1(Ax)(cid:12)Ux)(2.6)Thatis,NA(w,W;x)denotesthefunctionwherewecomputetheactivationpattern1(Ax)bythematrixAinsteadofU.Whenuisclearfromthecontext,withslightabuseofnotation,wewriteNA(U;x),u>(1(Ax)(cid:12)Ux).Inthisnotation,ourmodelisde\ufb01nedasf(u,U;x)=NU(u,U;x).WeconsiderseveraldifferentstructuresregardingtheweightmatricesU.Thesimplestversion4\fwhichweconsiderinthemainbodyofthispaperisthatUcanbedecomposedintotwoU=(cid:20)WV(cid:21)whereWonlyoperatesonthe\ufb01rstdcoordinates(thatis,thelastdcolumnsofWarezero),andVonlyoperatesonthelastdcoordinates(thosecoordinatesofx2.)NotethatWoperatesonthePcomponentofexamples,andVoperatesontheQcomponentofexamples.Inthiscase,themodelcanbedecomposedintof(u,U;x)=NU(u,U;x)=NW(w,W;x)+NV(v,V;x)=NW(w,W;x1)+NV(v,V;x2)HereweslightlyabusethenotationtouseWtodenotebothamatrixof2dcolumnswithlastdcolumnsbeingzero,oramatrixofdcolumns.WealsoextendourtheoremtootherUsuchasatwolayerconvolutionnetworkinSectionF.Trainingobjective.Let\u2018(f;(x,y))bethelossoftheexample(x,y)undermodelf.Throughoutthepaperweusethelogisticloss\u2018(f;(x,y))=\u2212log11+e\u2212yf(x).WeusethestandardtraininglossfunctionbLde\ufb01nedas:bL(u,U)=1NPi\u2208[N]\u2018(cid:0)f(u,U;\u00b7);(x(i),y(i))(cid:1)andletbLS(u,U)denotetheaverageoversomesubsetSofexamplesinsteadoftheentiredataset.WeconsideraregularizedtrainingobjectivebL\u03bb(u,U)=bL(u,U)+\u03bb2kUk2F.Forthesimplicityofderivation,thesecondlayerweightvectoruisrandominitializedand\ufb01xedthroughoutthispaper.ThuswithslightabuseofnotationthetrainingobjectivecanbewrittenasbL\u03bb(U)=bL(u,U)+\u03bb2kUk2F.Notations.Herewecollectadditionalnotationsthatwillbeusefulthroughoutourproofs.Thesymbol\u2295willrefertothesymmetricdifferenceoftwosetsortwobinaryvectors.Thesymbol\\referstothesetdifference.Letusde\ufb01neM1tobethesetofalli\u2208[N]suchthatx(i)16=0,let\u00afM1=[N]\\M1.LetM2tobethesetofalli\u2208[N]suchthatx(i)26=0,let\u00afM2=[N]\\M2.Wede\ufb01neq=|\u00afM1|Nandp=|\u00afM2|NtobetheempiricalfractionofdatacontainingpatternsonlyfromQandP,respectively.WewillsometimesusebEtodenoteanempiricalexpectationoverthetrainingsamples.Foravectorormatrixv,weusesupp(v)todenotethesetofindicesofthenon-zeroentriesofv.ForU\u2208Rm\u00d7dandR\u2282[m],letURbetherestrictionofUtothesubsetofrowsindexedbyR.Weuse[U]itodenotethei-throwofUasarowvectorinR1\u00d7d.Letthesymbol(cid:12)denotetheelement-wiseproductbetweentwovectorsormatrices.ThenotationIn\u00d7nwilldenotethen\u00d7nidentitymatrix,and1theall1\u2019svectorwheredimensionwillbeclearfromcontext.Wede\ufb01ne\u201cwithhighprobability\u201dtomeanwithprobabilityatleast1\u2212e\u2212Clog2(d)forasuf\ufb01cientlylargeconstantC.\u02dcO,\u02dc\u2126willbeusedtohidepolylogfactorsofd.3MainResultsThetrainingalgorithmthatweconsiderisstochasticgradientdescentwithsphericalGaussiannoise.Weremarkthatweanalyzethisalgorithmasasimpli\ufb01cationoftheminibatchSGDnoiseencounteredwhentrainingreal-worldnetworks.Thereareanumberofworkstheoreticallycharacterizingthisparticularnoisedistribution[19,18,38],andweleaveanalysisofthissettingtofuturework.WeinitializeU0tohavei.i.d.entriesfromaGaussiandistributionwithvariance\u03c420,andateachiterationofgradientdescentweaddsphericalGaussiannoisewithcoordinate-wisevariance\u03c42\u03betothegradientupdates.Thatis,thelearningalgorithmforthemodelisU0\u223cN(0,\u03c420Im\u00d7m\u2297Id\u00d7d)Ut+1=Ut\u2212\u03b3t\u2207U(bL\u03bb(u,Ut)+\u03bet)=(1\u2212\u03b3t\u03bb)Ut\u2212\u03b3t(\u2207UbL(u,Ut)+\u03bet)(3.1)where\u03bet\u223cN(0,\u03c42\u03beIm\u00d7m\u2297Id\u00d7d)(3.2)where\u03b3tdenotesthelearningrateattimet.Wewillanalyzetwoalgorithms:Algorithm1(L-S):Thelearningrateis\u03b71fort0iterationsuntilthetraininglossdropsbelowthethreshold\u03b51+qlog2.Thenweannealthelearningrateto\u03b3t=\u03b72(whichisassumedtobemuchsmallerthan\u03b71)andrununtilthetraininglossdropsto\u03b52.Algorithm2(S):Weuseda\ufb01xedlearningrateof\u03b72andstopattrainingloss\u03b502\u2264\u03b52.5\fFortheconvenienceoftheanalysis,wemakethefollowingassumptionthatwechoose\u03c40inawaysuchthatthecontributionofthenoisesinthesystemstabilizeattheinitialization:2Assumption3.1.After\ufb01xing\u03bband\u03c4\u03be,wechooseinitialization\u03c40andlargelearningrate\u03b71sothat(1\u2212\u03b71\u03bb)2\u03c420+\u03b721\u03c42\u03be=\u03c420(3.3)Asatechnicalassumptionforourproofs,wewillalsorequire\u03b71.\u03b51.Wealsorequiresuf\ufb01cientover-parametrization.Assumption3.2(Over-parameterization).Weassumethroughoutthepaperthat\u03c40=1/poly(cid:0)d\u03b5(cid:1)andm\u2265poly(cid:16)d\u03b5\u03c40(cid:17)wherepolyisasuf\ufb01cientlylargeconstantdegreepolynomial.Wenotethatwecanchoose\u03c40arbitrarilysmall,solongasitis\ufb01xedbeforewechoosem.Aswewillseesoon,thepreciserelationbetweenN,dimpliesthatthelevelofover-parameterizationispolynomialinN,\u0001,which\ufb01tswiththeconditionsassumedinpriorworks,suchas[26,13].Assumption3.3.Throughoutthispaper,weassumethefollowingdependenciesbetweentheparam-eters.WeassumethatN,d\u2192\u221ewitharelationshipNd=1\u03ba2where\u03ba\u2208(0,1)isasmallvalue.3Wesetr=d\u22123/4,p0=\u03ba2/2,andq0=\u0398(1).Theregularizerwillbechosentobe\u03bb=d\u22125/4.Allofthesechoicesofhyper-parameterscanberelaxed,butforsimplicityofexpositionweonlyworkthissetting.Wenotethatunderourassumptions,forsuf\ufb01cientlylargeN,p\u2248p0andq\u2248q0uptoconstantmultiplicativefactors.Thuswewillmostlyworkwithpandq(theempiricalfractions)intherestofthepaper.Wealsonotethatourparameterchoicesatis\ufb01es(rd)\u22121,d\u03bb,\u03bb/r\u2264\u03baO(1)and\u03bb\u2264r2/(\u03ba2q3p2),whichareafewconditionsthatwefrequentlyuseinthetechnicalpartofthepaper.NowwepresentourmaintheoremsregardingthegeneralizationofmodelstrainedwiththeL-SandSalgorithms.The\ufb01nalgeneralizationerrorofthemodeltrainedwiththeL-SalgorithmwillendupafactorO(\u03ba)=O(p1/2)smallerthanthegeneralizationerrorofthemodeltrainedwithSalgorithm.Theorem3.4(AnalysisofAlgorithmL-S).UnderAssumption3.1,3.2,and3.3,thereexistsauniversalconstant0<c<1/16suchthatAlgorithm1(L-S)withannealingatloss\u03b51+qlog2for\u03b51\u2208(cid:0)d\u2212c,\u03ba2p2q3(cid:1)andstoppingcriterion\u03b52=p\u03b51/qsatis\ufb01esthefollowing:1.ItannealsthelearningratewithineO(cid:16)d\u03b71\u03b51(cid:17)iterations.2.Itstopsatatmostt=eO(cid:16)d\u03b71\u03b51+1\u03b72r\u03b531(cid:17).Withprobabilityatleast0.99,thesolutionUthastest(classi\ufb01cation)errorandtestlossatmostO(cid:16)p\u03balog1\u03b51(cid:17).Roughly,thelearningorderandgeneralizationoftheL-Smodelisasfollows:beforeannealingthelearningrate,themodelonlylearnsaneffectiveclassi\ufb01erforPonthe\u2248(1\u2212q)NsamplesinM1asthelargelearningratecreatestoomuchnoisetoeffectivelylearnQ(Lemma4.1andLemma4.2).Afterthelearningrateisannealed,themodelmemorizesQandcorrectlyclassi\ufb01esexampleswithonlyaQcomponentduringtesttime(formallyshowninLemmas4.3and4.4).ForexampleswithonlyPcomponent,thegeneralizationerroris(ignoringlogfactorsandothertechnicalities)pqdN=O(p\u03ba)viastandardRademachercomplexity.ThefullanalysisoftheL-Salgorithmisclari\ufb01edinSection4.Theorem3.5(LowerboundforAlgorithmS).Let\u03b52bechoseninTheorem3.4.UnderAssump-tion3.1,3.2and3.3,thereexistsauniversalconstantc>0suchthatw.h.p,Algorithm2withany\u03b72\u2264\u03b71d\u2212candanystoppingcriterion\u03b502\u2208(d\u2212c,\u03b52],achievestrainingloss\u03b502inatmosteO(cid:16)d\u03b72\u03b502(cid:17)iterations,andboththetesterrorandthetestlossoftheobtainedsolutionareatleast\u2126(p).2Let\u03c400bethesolutionto(3.3)holding\u03c4\u03be,\u03b71,\u03bb\ufb01xed.Ifthestandarddeviationoftheinitializationischosentobesmallerthan\u03c400,thenstandarddeviationofthenoisewillgrowto\u03c400.Otherwiseiftheinitializationischosentobelarger,thecontributionofthenoisewilldecreasetothelevelof\u03c400duetoregularization.IntypicalanalysisofSGDwithsphericalnoises,oftenaslongaseitherthenoiseorthelearningrateissmallenough,theproofgoesthrough.However,herewewillmakeexplicituseofthelargelearningrateorthelargenoisetoshowbettergeneralizationperformance.3Orinanon-asymptoticlanguage,weassumethatN,daresuf\ufb01cientlylargecomparedto\u03ba:N,d(cid:29)poly(\u03ba)6\fWeexplainthislowerboundasfollows:theSalgorithmwillquicklymemorizetheQcomponentwhichislownoiseandignorethePcomponentforthe\u22481\u2212p\u2212qexampleswithbothPandQcomponents(showninLemma5.2).Thus,itonlylearnsPon\u2248pNexamples.Itobtainsasmallmarginontheseexamplesandthereforemisclassi\ufb01esaconstantfractionofP-onlyexamplesattesttime.Thisresultsinthelowerboundof\u2126(p).WeformalizetheanalysisinSection5.DecouplingtheIterates.ItwillbefruitfulforouranalysistoseparatelyconsiderthegradientsignalandGaussiannoisecomponentsoftheweightmatrixUt.WewilldecomposetheweightmatrixUtasfollows:Ut=Ut+eUt.Inthisformula,Utdenotesthesignalsfromallthegradientupdatesaccumulatedovertime,andeUtreferstothenoiseaccumulatedovertime:Ut=\u2212tXs=1\u03b3s\u22121 t\u22121Yi=s(1\u2212\u03b3i\u03bb)!\u2207bL(Us\u22121)eUt= t\u22121Yi=0(1\u2212\u03b3i\u03bb)!U0\u2212tXs=1\u03b3s\u22121 t\u22121Yi=s(1\u2212\u03b3i\u03bb)!\u03bes\u22121(3.4)Notethatwhenthelearningrate\u03b3tisalways\u03b7,theformulasimpli\ufb01estoUt=Pts=1\u03b7(1\u2212\u03b7\u03bb)t\u2212s\u2207bL(Us\u22121)andeUt=(1\u2212\u03b7\u03bb)tU0+Pts=1\u03b7(1\u2212\u03b7\u03bb)t\u2212s\u03bes\u22121.Thedecouplingandourparticularchoiceofinitializationsatis\ufb01esthatthenoiseupdatesinthesystemstabilizeatinitialization,sothemarginaldistributionofeUtisalwaysthesameastheinitialization.Anotherniceaspectofthesignal-noisedecompositionisasfollows:weusetoolsfrom[6]toshowthatifthesignaltermUissmall,thenusingonlythenoisecomponenteUtocomputetheactivationsroughlypreservestheoutputofthenetwork.Thisfacilitatesouranalysisofthenetworkdynamics.SeeSectionA.1forfulldetails.DecompositionofNetworkOutputs.Forconvenience,wewillexplicitlydecomposethemodelpredictionateachtimeintotwocomponents,eachofwhichoperatesononepattern:wehaveNUt(u,Ut;x)=gt(x)+rt(x),wheregt(x)=gt(x2),NVt(v,Vt;x)=NVt(v,Vt;x2)(3.5)rt(x)=rt(x1),NWt(w,Wt;x)=NWt(w,Wt;x1)(3.6)Inotherwords,thenetworkgtactsontheQcomponentofexamples,andthenetworkrtactsonthePcomponentofexamples.4CharacterizationofAlgorithm1(L-S)WecharacterizethebehaviorofalgorithmL-Swithlargeinitiallearningrate.WeprovideproofsketchesinSectionB.1withfullproofsinSectionD.PhaseI:initiallearningrate\u03b71.Thefollowinglemmaboundstherateofconvergencetothepointwherethelossgetsannealed.Italsoboundsthetotalgradientsignalaccumulatedbythispoint.Lemma4.1.InthesettingofTheorem3.4,atsometimestept0\u2264eO(cid:16)d\u03b71\u03b51(cid:17),thetraininglossbL(Ut0)becomessmallerthanqlog2+\u00011.Moreover,wehavekUt0k2F=O(cid:16)dlog21\u03b51(cid:17).OurproofofLemma4.1viewstheSGDdynamicsasoptimizationwithrespecttotheneuraltangentkernelinducedbytheactivationpatternswherethekernelisrapidlychangingduetothenoiseterms\u03be.ThisisincontrasttothestandardNTKregime,wheretheactivationpatternsareassumedtobestable[13,26].OuranalysisextendstheNTKtechniquestodealwithasequenceofchangingkernelswhichshareacommonoptimalclassi\ufb01er(seeSectionB.1andTheoremB.2foradditionaldetails).Thenextlemmasaysthatwithlargeinitiallearningrate,thefunctiongtdoesnotlearnanythingmeaningfulfortheQcomponentbeforethe1\u03b71\u03bb-timestep.Notethatbyourchoiceofparameters1/\u03bb(cid:29)dandLemma4.1,weannealatthetimestepeO(cid:16)d\u03b71\u03b51(cid:17)\u22641\u03b71\u03bb.Therefore,thefunctionhasnotlearnedanythingmeaningfulaboutthememorizablepatternondistributionQbeforeweanneal.Lemma4.2.InthesettingofTheorem3.4,w.h.p.,foreveryt\u22641\u03b71\u03bb,|gt(z+\u03b6)+gt(z\u2212\u03b6)\u22122gt(z)|\u2264eO(cid:18)r2\u03bb(cid:19)=eO(d\u22121/4)(4.1)7\fPhaseII:afterannealingthelearningrateto\u03b72.Afteriterationt0,wedecreasethelearningrateto\u03b72.Thefollowinglemmaboundshowfastthelossconvergesafterannealing.Lemma4.3.InthesettingofTheorem3.4,thereexistst=eO(cid:16)1\u03b531\u03b72r(cid:17),suchthataftert0+titerations,wehavethatbL(Ut)=O(cid:16)p\u03b51/q(cid:17)Moreover,kUt0+t\u2212Ut0k2F\u2264eO(cid:16)1\u03b521r(cid:17)\u2264O(d).ThefollowinglemmaboundsthetraininglossontheexamplesubsetsM1,\u00afM1.Lemma4.4.InthesettingofLemma4.3usingthesamet=eO(cid:16)1\u03b531\u03b72r(cid:17),theaveragetraininglossesonthesubsetsM1and\u00afM1arebothgoodinthesensethatbLM1(rt0+t)=O(p\u03b51/q)andbL\u00afM1(gt0+t)=O(p\u03b51/q3)(4.2)Intuitively,lowtraininglossofgt0+ton\u00afM1immediatelyimpliesgoodgeneralizationonexamplescontainingpatternsfromQ.Meanwhile,theclassi\ufb01erforP,rt0+t,haslowlosson(1\u2212q)Nexamples.ThenthetesterrorboundfollowsfromstandardRademachercomplexitytoolsappliedtothese(1\u2212q)Nexamples.5CharacterizationofAlgorithm2(S)Wepresentoursmalllearningratelemmas,withproofssketchesinSectionB.2andfullproofsinSectionE.Traininglossconvergence.Thebelowlemmashowsthatthealgorithmwillconvergetosmalltrainingerrortooquickly.Inparticular,thenormofWtisnotlargeenoughtoproducealargemarginsolutionforthosexsuchthatx2=0.Lemma5.1.InthesettingofTheorem3.5,thereexistsatimet0=\u02dcO(cid:16)1\u03b72\u03b5032r(cid:17)suchthatbLM2(Ut0)\u2264\u03b502.Moreover,thereexiststwitht=\u02dcO(cid:16)1\u03b72\u03b5032r+Np\u03b72\u03b502(cid:17)suchthatbL(Ut)\u2264\u03b502aftertiterations.Moreover,wehavethatkUtk2F\u2264\u02dcO(cid:16)1\u03b5022r+Np(cid:17).Lowerboundonthegeneralizationerror.Thefollowingimportantlemmastatesthatourclassi\ufb01erforPdoesnotlearnmuchfromtheexamplesinM2.Intuitively,underasmalllearningrate,theclassi\ufb01erwillalreadylearnsoquicklyfromtheQcomponentoftheseexamplesthatitwillnotlearnfromthePcomponentofexamplesinM1\u2229M2.WemakethisprecisebyshowingthatthemagnitudeofthegradientsonM2issmall.Lemma5.2.Inthesettingoftheorem3.5,letW(2)t=1N\u03b72Xs\u2264t(1\u2212\u03b72\u03bb)t\u2212sXi\u2208M2\u2207WbL{i}(Us)(5.1)bethe(accumulated)gradientoftheweightW,restrictedtothesubsetM2.Then,foreveryt=O(d/\u03b72\u03b502),wehave:(cid:13)(cid:13)(cid:13)W(2)t(cid:13)(cid:13)(cid:13)F\u2264\u02dcO(cid:0)d15/32/\u03b5022(cid:1).Fornotationsimplicity,wewillde\ufb01ne\u03b53=d\u22121/321\u03b5022.Then,(cid:13)(cid:13)(cid:13)W(2)t(cid:13)(cid:13)(cid:13)F\u2264\u02dcO(cid:16)\u221ad\u03b53(cid:17).TheabovelemmaimpliesthatWdoesnotlearnmuchfromexamplesinM2,andthereforemustover\ufb01ttothepNexamplesin\u00afM2.AspN\u2264d/2byourchoiceofparameters,wewillnothaveenoughsamplestolearnthed-dimensionaldistributionP.ThefollowinglemmaformalizestheintuitionthatthemarginwillbepooronsamplesfromP.Lemma5.3.Thereexists\u03b1\u2208Rdsuchthat\u03b1\u2208span{x(i)1}i\u2208\u00afM2andk\u03b1k2=\u02dc\u2126(\u221aNp)suchthatw.h.p.overarandomlychosenx1,wehavethatrt(x1)\u2212rt(\u2212x1)=2h\u03b1,x1i\u00b1\u02dcO(\u03b53)(5.2)Asthemarginispoor,thepredictionswillbeheavilyin\ufb02uencedbynoise.Weusethisintuitiontoprovetheclassi\ufb01cationlowerboundforTheorem3.5.8\fFigure3:Accuracyvs.epochonpatch-augmentedCIFAR-10.Thegraylineindicatesannealingofactivationnoiseandlearningrate.Left:Cleanvalidationset.Right:Imagescontainingonlythepatch.6ExperimentsOurtheorysuggeststhataddingnoisetothenetworkcouldbeaneffectivestrategytoregularizeasmalllearningrateinpractice.WetestthisempiricallybyaddingsmallGaussiannoiseduringtrainingbeforeeveryactivationlayerinaWideResNet16[42]architecture,asouranalysishighlightspre-activationnoiseasakeyregularizationmechanismofSGD.Thenoiselevelisannealedovertime.WedemonstrateonCIFAR-10imageswithoutdataaugmentationthatthisregularizationcanindeedcounteractthenegativeeffectsofsmalllearningrate,aswereporta4.72%increaseinvalidationaccuracywhenaddingnoisetoasmalllearningrate.FulldetailsareinSectionH.1.Wewillalsoempiricallydemonstratethatthechoiceoflargevs.smallinitiallearningratecanindeedinvertthelearningorderofdifferentexampletypes.Weaddamemorizable7\u00d77pixelpatchtoasubsetofCIFAR-10imagesfollowingthescenariopresentedin(1.1),suchthataround20%ofimageshavenopatch,16%ofimagescontainonlyapatch,and64%containbothCIFAR-10dataandpatch.Wegeneratethepatchessothattheyarenoteasilyseparable,asinourconstructedQ,buttheyarelowinvariationandthereforeeasytomemorize.Precisedetailsonproducingthedata,includingavisualizationofthepatch,areinSectionH.2.Wetrainonthemodi\ufb01eddatasetusingWideResNet16using3methods:largelearningratewithannealingatthe30thepoch,smallinitiallearningrate,andsmalllearningratewithnoiseannealedatthe30thepoch.Figure3depictsthevalidationaccuracyvs.epochonclean(nopatch)andpatch-onlyimages.Fromtheplots,itisapparentthatthesmalllearningratepicksupthesignalinthepatchveryquickly,whereastheothertwomethodsonlymemorizethepatchafterannealing.Fromthevalidationaccuracyoncleanimages,wecandeducethatthesmalllearningratemethodisindeedlearningtheCIFARimagesusingasmallfractionofalltheavailabledata,asthevalidationaccuracyofasmallLRmodelwhentrainingonthefulldatasetisaround83%,butthevalidationoncleandataaftertrainingwiththepatchis70%.WeprovideadditionalargumentsinSectionH.2.7ConclusionInthiswork,weshowthattheorderinwhichaneuralnetlearnsto\ufb01tdifferenttypesofpatternsplaysacrucialroleingeneralization.Todemonstratethis,weconstructadistributiononwhichmodelstrainedwithlargelearningratesgeneralizeprovablybetterthanthosetrainedwithsmalllearningratesduetolearningorder.OuranalysisrevealsthatmoreSGDnoise,orlargerlearningrate,biasesthemodeltowardslearning\u201cgeneralizing\u201dkernelsratherthan\u201cmemorizing\u201dkernels.Wecon\ufb01rmonartici\ufb01allymodi\ufb01edCIFAR-10datathatthescaleofthelearningratecanindeedin\ufb02uencelearningorderandgeneralization.Inspiredbythese\ufb01ndings,weproposeamitigationstrategythatinjectsnoisebeforetheactivationsandworksboththeoreticallyforourconstructionandempirically.Thedesignofbetteralgorithmsforregularizinglearningorderisanexcitingquestionforfuturework.AcknowledgementsCWacknowledgessupportfromaNSFGraduateResearchFellowship.9\fReferences[1]ZeyuanAllen-ZhuandYuanzhiLi.CanSGDlearnrecurrentneuralnetworkswithprovablegeneralization?CoRR,abs/1902.01028,2019.URLhttp://arxiv.org/abs/1902.01028.[2]ZeyuanAllen-ZhuandYuanzhiLi.Whatcanresnetlearnef\ufb01ciently,goingbeyondkernels?CoRR,abs/1905.10337,2019.URLhttp://arxiv.org/abs/1905.10337.[3]ZeyuanAllen-Zhu,YuanzhiLi,andYingyuLiang.LearningandGeneralizationinOverpa-rameterizedNeuralNetworks,GoingBeyondTwoLayers.arXivpreprintarXiv:1811.04918,November2018.[4]ZeyuanAllen-Zhu,YuanzhiLi,andZhaoSong.Ontheconvergencerateoftrainingrecurrentneuralnetworks.arXivpreprintarXiv:1810.12065,2018.[5]ZeyuanAllen-Zhu,YuanzhiLi,andZhaoSong.Ontheconvergencerateoftrainingrecurrentneuralnetworks.arXivpreprintarXiv:1810.12065,2018.[6]ZeyuanAllen-Zhu,YuanzhiLi,andZhaoSong.Aconvergencetheoryfordeeplearningviaover-parameterization.arXivpreprintarXiv:1811.03962,November2018.[7]SanjeevArora,SimonS.Du,WeiHu,ZhiyuanLi,andRuosongWang.Fine-grainedanalysisofoptimizationandgeneralizationforoverparameterizedtwo-layerneuralnetworks.CoRR,abs/1901.08584,2019.URLhttp://arxiv.org/abs/1901.08584.[8]PeterLBartlettandShaharMendelson.Rademacherandgaussiancomplexities:Riskboundsandstructuralresults.JournalofMachineLearningResearch,3(Nov):463\u2013482,2002.[9]OlivierBousquetandAndr\u00e9Elisseeff.Stabilityandgeneralization.Journalofmachinelearningresearch,2(Mar):499\u2013526,2002.[10]JinghuiChenandQuanquanGu.Closingthegeneralizationgapofadaptivegradientmethodsintrainingdeepneuralnetworks.arXivpreprintarXiv:1806.06763,2018.[11]XiaowuDaiandYuhuaZhu.Towardstheoreticalunderstandingoflargebatchtraininginstochasticgradientdescent.arXivpreprintarXiv:1812.00542,2018.[12]SimonS.Du,JasonD.Lee,YuandongTian,Barnab\u00e1sP\u00f3czos,andAartiSingh.Gradientdescentlearnsone-hidden-layerCNN:don\u2019tbeafraidofspuriouslocalminima.InInternationalConferenceonMachineLearning(ICML).http://arxiv.org/abs/1712.00779,2018.[13]SimonS.Du,XiyuZhai,BarnabasPoczos,andAartiSingh.GradientDescentProvablyOptimizesOver-parameterizedNeuralNetworks.ArXive-prints,2018.[14]JohnDuchi,EladHazan,andYoramSinger.Adaptivesubgradientmethodsforonlinelearningandstochasticoptimization.JournalofMachineLearningResearch,12(Jul):2121\u20132159,2011.[15]RongGe,ShamM.Kakade,RahulKidambi,andPraneethNetrapalli.TheStepDecaySched-ule:ANearOptimal,GeometricallyDecayingLearningRateProcedure.arXive-prints,art.arXiv:1904.12838,Apr2019.[16]PriyaGoyal,PiotrDoll\u00e1r,RossGirshick,PieterNoordhuis,LukaszWesolowski,AapoKyrola,AndrewTulloch,YangqingJia,andKaimingHe.Accurate,largeminibatchsgd:Trainingimagenetin1hour.arXivpreprintarXiv:1706.02677,2017.[17]KaimingHe,XiangyuZhang,ShaoqingRen,andJianSun.Deepresiduallearningforimagerecognition.InProceedingsoftheIEEEconferenceoncomputervisionandpatternrecognition,pages770\u2013778,2016.[18]EladHoffer,ItayHubara,andDanielSoudry.Trainlonger,generalizebetter:closingthegeneralizationgapinlargebatchtrainingofneuralnetworks.InAdvancesinNeuralInformationProcessingSystems,pages1731\u20131741,2017.[19]WenqingHu,ChrisJunchiLi,LeiLi,andJian-GuoLiu.Onthediffusionapproximationofnonconvexstochasticgradientdescent.arXivpreprintarXiv:1705.07562,2017.10\f[20]Stanis\u0142awJastrz\u02dbebski,ZacharyKenton,NicolasBallas,AsjaFischer,YoshuaBengio,andAmosStorkey.Dnn\u2019ssharpestdirectionsalongthesgdtrajectory.arXivpreprintarXiv:1807.05031,2018.[21]NitishShirishKeskarandRichardSocher.Improvinggeneralizationperformancebyswitchingfromadamtosgd.arXivpreprintarXiv:1712.07628,2017.[22]NitishShirishKeskar,DheevatsaMudigere,JorgeNocedal,MikhailSmelyanskiy,andPingTakPeterTang.Onlarge-batchtrainingfordeeplearning:Generalizationgapandsharpminima.arXivpreprintarXiv:1609.04836,2016.[23]DiederikPKingmaandJimmyBa.Adam:Amethodforstochasticoptimization.arXivpreprintarXiv:1412.6980,2014.[24]RobertKleinberg,YuanzhiLi,andYangYuan.Analternativeview:WhendoesSGDescapelocalminima?CoRR,abs/1802.06175,2018.URLhttp://arxiv.org/abs/1802.06175.[25]AlexKrizhevsky,IlyaSutskever,andGeoffreyEHinton.Imagenetclassi\ufb01cationwithdeepconvolutionalneuralnetworks.InAdvancesinneuralinformationprocessingsystems,pages1097\u20131105,2012.[26]YuanzhiLiandYingyuLiang.Learningoverparameterizedneuralnetworksviastochasticgradientdescentonstructureddata.InAdvancesinNeuralInformationProcessingSystems,pages8157\u20138166,2018.[27]YuanzhiLi,TengyuMa,andHongyangZhang.Algorithmicregularizationinover-parameterizedmatrixrecovery.CoRR,abs/1712.09203,2017.URLhttp://arxiv.org/abs/1712.09203.[28]IlyaLoshchilovandFrankHutter.Sgdr:Stochasticgradientdescentwithwarmrestarts.arXivpreprintarXiv:1608.03983,2016.[29]LiangchenLuo,YuanhaoXiong,YanLiu,andXuSun.Adaptivegradientmethodswithdynamicboundoflearningrate.arXivpreprintarXiv:1902.09843,2019.[30]KarttikeyaMangalamandVinayPrabhu.Dodeepneuralnetworkslearnshallowlearnableexamples\ufb01rst?June2019.[31]PreetumNakkiran,GalKaplun,DimitrisKalimeris,TristanYang,BenjaminL.Edelman,FredZhang,andBoazBarak.SGDonNeuralNetworksLearnsFunctionsofIncreasingComplexity.arXive-prints,art.arXiv:1905.11604,May2019.[32]KarenSimonyanandAndrewZisserman.Verydeepconvolutionalnetworksforlarge-scaleimagerecognition.arXivpreprintarXiv:1409.1556,2014.[33]LeslieNSmith.Cyclicallearningratesfortrainingneuralnetworks.In2017IEEEWinterConferenceonApplicationsofComputerVision(WACV),pages464\u2013472.IEEE,2017.[34]SamuelLSmithandQuocVLe.Abayesianperspectiveongeneralizationandstochasticgradientdescent.arXivpreprintarXiv:1710.06451,2017.[35]SamuelLSmith,Pieter-JanKindermans,ChrisYing,andQuocVLe.Don\u2019tdecaythelearningrate,increasethebatchsize.arXivpreprintarXiv:1711.00489,2017.[36]DanielSoudry,EladHoffer,MorShpigelNacson,SuriyaGunasekar,andNathanSrebro.Theimplicitbiasofgradientdescentonseparabledata.TheJournalofMachineLearningResearch,19(1):2822\u20132878,2018.[37]TijmenTielemanandGeoffreyHinton.Lecture6.5-rmsprop,coursera:Neuralnetworksformachinelearning.UniversityofToronto,TechnicalReport,2012.[38]YemingWen,KevinLuk,MaximeGazeau,GuodongZhang,HarrisChan,andJimmyBa.Interplaybetweenoptimizationandgeneralizationofstochasticgradientdescentwithcovariancenoise.arXivpreprintarXiv:1902.08234,2019.11\f[39]AshiaCWilson,RebeccaRoelofs,MitchellStern,NatiSrebro,andBenjaminRecht.Themarginalvalueofadaptivegradientmethodsinmachinelearning.InAdvancesinNeuralInformationProcessingSystems,pages4148\u20134158,2017.[40]ChenXing,DevanshArpit,ChristosTsirigotis,andYoshuaBengio.Awalkwithsgd.arXivpreprintarXiv:1802.08770,2018.[41]YangYou,IgorGitman,andBorisGinsburg.Largebatchtrainingofconvolutionalnetworks.arXivpreprintarXiv:1708.03888,2017.[42]SergeyZagoruykoandNikosKomodakis.Wideresidualnetworks.arXivpreprintarXiv:1605.07146,2016.[43]MatthewDZeiler.Adadelta:anadaptivelearningratemethod.arXivpreprintarXiv:1212.5701,2012.12\f", "award": [], "sourceid": 6229, "authors": [{"given_name": "Yuanzhi", "family_name": "Li", "institution": "Princeton"}, {"given_name": "Colin", "family_name": "Wei", "institution": "Stanford University"}, {"given_name": "Tengyu", "family_name": "Ma", "institution": "Stanford University"}]}