쿠...sal: [컴] hadoop deamon 들 설정

하둡 설정 / configuration /

hadoop deamon 들 설정

2종류의 config file들

read-only 기본 설정
- core-default.xml : http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/core-default.xml
- hdfs-default.xml : https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
- yarn-default.xml : https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
- mapred-default.xml : https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
site specific configuration
- etc/hadoop/core-site.xml
- etc/hadoop/hdfs-site.xml
- etc/hadoop/yarn-site.xml
- etc/hadoop/mapred-site.xml

configurations

ref. 1 에 가면, ’기본값’과 설정값의 ’의미’를 알 수 있다.

etc/hadoop/core-site.xml

fs.defaultFS
io.file.buffer.size

etc/hadoop/hdfs-site.xml

Name node
- dfs.namenode.name.dir
- dfs.hosts / dfs.hosts.exclude
- dfs.blocksize
- dfs.namenode.handler.count
Data Node
- dfs.datanode.data.dir

etc/hadoop/yarn-site.xml

Resource Manager, Node Manager
- yarn.acl.enable
- yarn.admin.acl
- yarn.log-aggregation-enable
Resource Manager
- yarn.resourcemanager.address
- yarn.resourcemanager.scheduler.address
- yarn.resourcemanager.resource-tracker.address
- yarn.resourcemanager.admin.address
- yarn.resourcemanager.webapp.address
- yarn.resourcemanager.hostname
- yarn.resourcemanager.scheduler.class
- yarn.scheduler.minimum-allocation-mb
- yarn.scheduler.maximum-allocation-mb
- yarn.resourcemanager.nodes.include-path
Node Manager
- yarn.nodemanager.resource.memory-mb
- yarn.nodemanager.vmem-pmem-ratio
- yarn.nodemanager.local-dirs
- yarn.nodemanager.log-dirs
- yarn.nodemanager.log.retain-seconds
- yarn.nodemanager.remote-app-log-dir
- yarn.nodemanager.remote-app-log-dir-suffix
- yarn.nodemanager.aux-services
History Server
- yarn.log-aggregation.retain-seconds
- yarn.log-aggregation.retain-check-interval-seconds
Node Manager 의 health monitoring 을 위해 사용되는 설정
- yarn.nodemanager.health-checker.script.path
- yarn.nodemanager.health-checker.script.opts
- yarn.nodemanager.health-checker.interval-ms
- yarn.nodemanager.health-checker.script.timeout-ms

etc/hadoop/mapred-site.xml

MapReduce Applications
- mapreduce.framework.name
- mapreduce.map.memory.mb
- mapreduce.map.java.opts
- mapreduce.reduce.memory.mb
- mapreduce.reduce.java.opts
- mapreduce.task.io.sort.mb
- mapreduce.task.io.sort.factor
- mapreduce.reduce.shuffle.parallelcopies
MapReduce JobHistory Server
- mapreduce.jobhistory.address
- mapreduce.jobhistory.webapp.address
- mapreduce.jobhistory.intermediate-done-dir
- mapreduce.jobhistory.done-dir

table

Parameter		Value	Notes
fs.defaultFS	NameNode URI	NameNode URI	hdfs://host:port/
io.file.buffer.size	SequenceFiles에 사용되는 읽기/쓰기 버퍼의 크기	131072	Size of read/write buffer used in SequenceFiles.

dfs.namenode.name.dir	log 관련 로컬 파일 시스템의 경로. 이곳에 NameNode가 네임스페이스 및 트랜잭션 로그를 지속적으로 저장한다.	Path on the local filesystem where the NameNode stores the namespace and transactions logs persistently.	If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy.
dfs.hosts / dfs.hosts.exclude	permitted/excluded data node의 list	List of permitted/excluded DataNodes.	If necessary, use these files to control the list of allowable datanodes.
dfs.blocksize	HDFS block size 이다.	268435456	HDFS blocksize of 256MB for large file-systems.
dfs.namenode.handler.count	추가적인 thread 개수 많은 수의 DataNode로부터 오는 RPC들을 처리하기 위한 추가적인 NameNode server thread수	100	More NameNode server threads to handle RPCs from large number of DataNodes.

dfs.datanode.data.dir	data node의 local filesystem 의 경로 data node가 갖고 있게 되는 block 들을 저장할 위치	Comma separated list of paths on the local filesystem of a DataNode where it should store its blocks.	If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices.

yarn.acl.enable	Access Control List(ACL) 을 사용할지 여부, 기본값은 false	true / false	Enable ACLs? Defaults to false.
yarn.admin.acl	cluster의 admin 을 설정하기 위한 ACL	Admin ACL	ACL to set admins on the cluster. ACLs are of for comma-separated-usersspacecomma-separated-groups. Defaults to special value of * which means anyone. Special value of just space means no one has access.
yarn.log-aggregation-enable	log aggregation 을 할지 여부	FALSE	Configuration to enable or disable log aggregation

yarn.resourcemanager.address	RM주소, client들이 job을 submit 하기 위한 RM주소	ResourceManager host:port for clients to submit jobs.	host:port If set, overrides the hostname set in yarn.resourcemanager.hostname.
yarn.resourcemanager.scheduler.address	application master들이 resource를 얻기위해 scheduler 에게 이야기하기 위한 RM 주소	ResourceManager host:port for ApplicationMasters to talk to Scheduler to obtain resources.	host:port If set, overrides the hostname set in yarn.resourcemanager.hostname.
yarn.resourcemanager.resource-tracker.address	nodemanager들을 위한 RM 주소	ResourceManager host:port for NodeManagers.	host:port If set, overrides the hostname set in yarn.resourcemanager.hostname.
yarn.resourcemanager.admin.address	관리자 command 를 위한 RM 주소	ResourceManager host:port for administrative commands.	host:port If set, overrides the hostname set in yarn.resourcemanager.hostname.
yarn.resourcemanager.webapp.address	RM web-ui 주소	ResourceManager web-ui host:port.	host:port If set, overrides the hostname set in yarn.resourcemanager.hostname.
yarn.resourcemanager.hostname	host 를 적는다. 모든 yarn.resourcemanager*address 설정값들이 설정된다. Port는 각 component 의 기본값으로 설정된다.	ResourceManager host.	host Single hostname that can be set in place of setting all yarn.resourcemanager*address resources. Results in default ports for ResourceManager components.
yarn.resourcemanager.scheduler.class	사용하려는 scheduler를 변경할 때 쓸 수 있다.	ResourceManager Scheduler class.	CapacityScheduler (recommended), FairScheduler (also recommended), or FifoScheduler. Use a fully qualified class name, e.g., org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.
yarn.scheduler.minimum-allocation-mb	RM에서 각 container 요청에 할당할 메모리 최소치	Minimum limit of memory to allocate to each container request at the Resource Manager.	In MBs
yarn.scheduler.maximum-allocation-mb	RM에서 각 container 요청에 할당할 메모리 최대치	Maximum limit of memory to allocate to each container request at the Resource Manager.	In MBs
yarn.resourcemanager.nodes.include-path / yarn.resourcemanager.nodes.exclude-path	permitted/excluded nodemanager의 리스트	List of permitted/excluded NodeManagers.	If necessary, use these files to control the list of allowable NodeManagers.

yarn.nodemanager.resource.memory-mb	node manager가 사용가능한 물리적 메모리	Resource i.e. available physical memory, in MB, for given NodeManager	Defines total available resources on the NodeManager to be made available to running containers
yarn.nodemanager.vmem-pmem-ratio	각 task의 가상메모리 사용량이 물리적인 memory limit 을 넘을 수 있는 최대비율 (물리적 메모리의 몇배를 가상메모리로 사용할 것인가.?)	Maximum ratio by which virtual memory usage of tasks may exceed physical memory	The virtual memory usage of each task may exceed its physical memory limit by this ratio. The total amount of virtual memory used by tasks on the NodeManager may exceed its physical memory usage by this ratio.
yarn.nodemanager.local-dirs	local filesystem의 path들 여기에 intermediate data 가 쓰여진다. 여러 path 를 적으면 disk i/o 를 분산하기 좋다.	Comma-separated list of paths on the local filesystem where intermediate data is written.	Multiple paths help spread disk i/o.
yarn.nodemanager.log-dirs	log들이 쓰여질 local filesystem 의 path 여러 path 가 disk i/o 를 분산시키는 데 좋다.	Comma-separated list of paths on the local filesystem where logs are written.	Multiple paths help spread disk i/o.
yarn.nodemanager.log.retain-seconds	log aggregation 이 disabled 되어야 사용가능 node manager의 log files 를 유지하는 시간	10800	Default time (in seconds) to retain log files on the NodeManager Only applicable if log-aggregation is disabled.
yarn.nodemanager.remote-app-log-dir	log aggregation 이 enabled 되어야만 사용됨 application 이 끝났을때 application log 들이 move되는 HDFS directory 적절한 permission 을 설정해야 한다.	/logs	HDFS directory where the application logs are moved on application completion. Need to set appropriate permissions. Only applicable if log-aggregation is enabled.
yarn.nodemanager.remote-app-log-dir-suffix	log aggregation 이 enabled 되어야만 사용됨. Remote log dir 뒤에 붙는 글자(suffix) 로그는 yarn.nodemanager.remote − app − log − dir/{user}/${thisParam} 에 저장되게 된다.	logs	Suffix appended to the remote log dir. Logs will be aggregated to yarn.nodemanager.remote − app − log − dir/{user}/${thisParam} Only applicable if log-aggregation is enabled.
yarn.nodemanager.aux-services	Map Reduce application 들을 위해 설정될 필요가 있는 Shuffle service	mapreduce_shuffle	Shuffle service that needs to be set for Map Reduce applications.

yarn.log-aggregation.retain-seconds	aggregation log 를 얼마나 오래 가지고 있을건지	-1	How long to keep aggregation logs before deleting them. -1 disables. Be careful, set this too small and you will spam the name node.
yarn.log-aggregation.retain-check-interval-seconds	aggregated log 의 유지를 얼마만에 한번씩 확인할지. 0 또는 음수는 aggregated log retention time의 1/10 로 설정된다.	-1	Time between checks for aggregated log retention. If set to 0 or a negative value then the value is computed as one-tenth of the aggregated log retention time. Be careful, set this too small and you will spam the name node.

mapreduce.framework.name		yarn	Execution framework set to Hadoop YARN.
mapreduce.map.memory.mb	map들을 위한 더큰 resource limit	1536	Larger resource limit for maps.
mapreduce.map.java.opts	map들의 child jvm들의 더 큰 heap-size	-Xmx1024M	Larger heap-size for child jvms of maps.
mapreduce.reduce.memory.mb	reduce들을 위한 더큰 resource limit	3072	Larger resource limit for reduces.
mapreduce.reduce.java.opts	reduce들의 child jvm들의 더 큰 heap-size	-Xmx2560M	Larger heap-size for child jvms of reduces.
mapreduce.task.io.sort.mb	효율성을 위한 data 정렬을 하는 동안의 더 큰 memory limit	512	Higher memory-limit while sorting data for efficiency.
mapreduce.task.io.sort.factor	파일들을 정렬하는 동안에 한번에 merge 되는 stream들을 얼마나 더 많이	100	More streams merged at once while sorting files.
mapreduce.reduce.shuffle.parallelcopies	매우많은 수의 map으로 부터 결과들을 가져오기 위해 reduce에 의해 수행되는 더 많은 병렬 복사(parallel copies)	50	Higher number of parallel copies run by reduces to fetch outputs from very large number of maps.

mapreduce.jobhistory.address	MapReduce JobHistory Server	MapReduce JobHistory Server host:port	Default port is 10020.
mapreduce.jobhistory.webapp.address	MapReduce JobHistory Server Web UI	MapReduce JobHistory Server Web UI host:port	Default port is 19888.
mapreduce.jobhistory.intermediate-done-dir	MapReduce job들에 의해 history 파일들이 written 되는 directory	/mr-history/tmp	Directory where history files are written by MapReduce jobs.
mapreduce.jobhistory.done-dir	history 파일들이 MR JobHistory Server에 의해 관리되는 directory	/mr-history/done	Directory where history files are managed by the MR JobHistory Server.

References

Apache Hadoop 2.10.1 – Hadoop Cluster Setup

쿠...sal

[컴] hadoop deamon 들 설정

hadoop deamon 들 설정

2종류의 config file들

configurations

etc/hadoop/core-site.xml

etc/hadoop/hdfs-site.xml

etc/hadoop/yarn-site.xml

etc/hadoop/mapred-site.xml

table

See Also

References

댓글 없음:

댓글 쓰기