하둡 설정 / configuration /
hadoop deamon 들 설정
2종류의 config file들
- read-only 기본 설정
- core-default.xml : http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/core-default.xml
- hdfs-default.xml : https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
- yarn-default.xml : https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
- mapred-default.xml : https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
- core-default.xml : http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/core-default.xml
- site specific configuration
- etc/hadoop/core-site.xml
- etc/hadoop/hdfs-site.xml
- etc/hadoop/yarn-site.xml
- etc/hadoop/mapred-site.xml
configurations
ref. 1 에 가면, ’기본값’과 설정값의 ’의미’를 알 수 있다.
etc/hadoop/core-site.xml
- fs.defaultFS
- io.file.buffer.size
etc/hadoop/hdfs-site.xml
- Name node
- dfs.namenode.name.dir
- dfs.hosts / dfs.hosts.exclude
- dfs.blocksize
- dfs.namenode.handler.count
- Data Node
- dfs.datanode.data.dir
etc/hadoop/yarn-site.xml
- Resource Manager, Node Manager
- yarn.acl.enable
- yarn.admin.acl
- yarn.log-aggregation-enable
- Resource Manager
- yarn.resourcemanager.address
- yarn.resourcemanager.scheduler.address
- yarn.resourcemanager.resource-tracker.address
- yarn.resourcemanager.admin.address
- yarn.resourcemanager.webapp.address
- yarn.resourcemanager.hostname
- yarn.resourcemanager.scheduler.class
- yarn.scheduler.minimum-allocation-mb
- yarn.scheduler.maximum-allocation-mb
- yarn.resourcemanager.nodes.include-path
- Node Manager
- yarn.nodemanager.resource.memory-mb
- yarn.nodemanager.vmem-pmem-ratio
- yarn.nodemanager.local-dirs
- yarn.nodemanager.log-dirs
- yarn.nodemanager.log.retain-seconds
- yarn.nodemanager.remote-app-log-dir
- yarn.nodemanager.remote-app-log-dir-suffix
- yarn.nodemanager.aux-services
- History Server
- yarn.log-aggregation.retain-seconds
- yarn.log-aggregation.retain-check-interval-seconds
- Node Manager 의 health monitoring 을 위해 사용되는 설정
- yarn.nodemanager.health-checker.script.path
- yarn.nodemanager.health-checker.script.opts
- yarn.nodemanager.health-checker.interval-ms
- yarn.nodemanager.health-checker.script.timeout-ms
etc/hadoop/mapred-site.xml
- MapReduce Applications
- mapreduce.framework.name
- mapreduce.map.memory.mb
- mapreduce.map.java.opts
- mapreduce.reduce.memory.mb
- mapreduce.reduce.java.opts
- mapreduce.task.io.sort.mb
- mapreduce.task.io.sort.factor
- mapreduce.reduce.shuffle.parallelcopies
- MapReduce JobHistory Server
- mapreduce.jobhistory.address
- mapreduce.jobhistory.webapp.address
- mapreduce.jobhistory.intermediate-done-dir
- mapreduce.jobhistory.done-dir
table
Parameter | Value | Notes | |
---|---|---|---|
fs.defaultFS | NameNode URI | NameNode URI | hdfs://host:port/ |
io.file.buffer.size | SequenceFiles에 사용되는 읽기/쓰기 버퍼의 크기 | 131072 | Size of read/write buffer used in SequenceFiles. |
dfs.namenode.name.dir | log 관련 로컬 파일 시스템의 경로. 이곳에 NameNode가 네임스페이스 및 트랜잭션 로그를 지속적으로 저장한다. | Path on the local filesystem where the NameNode stores the namespace and transactions logs persistently. | If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy. |
dfs.hosts / dfs.hosts.exclude | permitted/excluded data node의 list | List of permitted/excluded DataNodes. | If necessary, use these files to control the list of allowable datanodes. |
dfs.blocksize | HDFS block size 이다. | 268435456 | HDFS blocksize of 256MB for large file-systems. |
dfs.namenode.handler.count | 추가적인 thread 개수 많은 수의 DataNode로부터 오는 RPC들을 처리하기 위한 추가적인 NameNode server thread수 | 100 | More NameNode server threads to handle RPCs from large number of DataNodes. |
dfs.datanode.data.dir | data node의 local filesystem 의 경로 data node가 갖고 있게 되는 block 들을 저장할 위치 | Comma separated list of paths on the local filesystem of a DataNode where it should store its blocks. | If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. |
yarn.acl.enable | Access Control List(ACL) 을 사용할지 여부, 기본값은 false | true / false | Enable ACLs? Defaults to false. |
yarn.admin.acl | cluster의 admin 을 설정하기 위한 ACL | Admin ACL | ACL to set admins on the cluster. ACLs are of for comma-separated-usersspacecomma-separated-groups. Defaults to special value of * which means anyone. Special value of just space means no one has access. |
yarn.log-aggregation-enable | log aggregation 을 할지 여부 | FALSE | Configuration to enable or disable log aggregation |
yarn.resourcemanager.address | RM주소, client들이 job을 submit 하기 위한 RM주소 | ResourceManager host:port for clients to submit jobs. | host:port If set, overrides the hostname set in yarn.resourcemanager.hostname. |
yarn.resourcemanager.scheduler.address | application master들이 resource를 얻기위해 scheduler 에게 이야기하기 위한 RM 주소 | ResourceManager host:port for ApplicationMasters to talk to Scheduler to obtain resources. | host:port If set, overrides the hostname set in yarn.resourcemanager.hostname. |
yarn.resourcemanager.resource-tracker.address | nodemanager들을 위한 RM 주소 | ResourceManager host:port for NodeManagers. | host:port If set, overrides the hostname set in yarn.resourcemanager.hostname. |
yarn.resourcemanager.admin.address | 관리자 command 를 위한 RM 주소 | ResourceManager host:port for administrative commands. | host:port If set, overrides the hostname set in yarn.resourcemanager.hostname. |
yarn.resourcemanager.webapp.address | RM web-ui 주소 | ResourceManager web-ui host:port. | host:port If set, overrides the hostname set in yarn.resourcemanager.hostname. |
yarn.resourcemanager.hostname | host 를 적는다. 모든 yarn.resourcemanager*address 설정값들이 설정된다. Port는 각 component 의 기본값으로 설정된다. | ResourceManager host. | host Single hostname that can be set in place of setting all yarn.resourcemanager*address resources. Results in default ports for ResourceManager components. |
yarn.resourcemanager.scheduler.class | 사용하려는 scheduler를 변경할 때 쓸 수 있다. | ResourceManager Scheduler class. | CapacityScheduler (recommended), FairScheduler (also recommended), or FifoScheduler. Use a fully qualified class name, e.g., org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler. |
yarn.scheduler.minimum-allocation-mb | RM에서 각 container 요청에 할당할 메모리 최소치 | Minimum limit of memory to allocate to each container request at the Resource Manager. | In MBs |
yarn.scheduler.maximum-allocation-mb | RM에서 각 container 요청에 할당할 메모리 최대치 | Maximum limit of memory to allocate to each container request at the Resource Manager. | In MBs |
yarn.resourcemanager.nodes.include-path / yarn.resourcemanager.nodes.exclude-path | permitted/excluded nodemanager의 리스트 | List of permitted/excluded NodeManagers. | If necessary, use these files to control the list of allowable NodeManagers. |
yarn.nodemanager.resource.memory-mb | node manager가 사용가능한 물리적 메모리 | Resource i.e. available physical memory, in MB, for given NodeManager | Defines total available resources on the NodeManager to be made available to running containers |
yarn.nodemanager.vmem-pmem-ratio | 각 task의 가상메모리 사용량이 물리적인 memory limit 을 넘을 수 있는 최대비율 (물리적 메모리의 몇배를 가상메모리로 사용할 것인가.?) | Maximum ratio by which virtual memory usage of tasks may exceed physical memory | The virtual memory usage of each task may exceed its physical memory limit by this ratio. The total amount of virtual memory used by tasks on the NodeManager may exceed its physical memory usage by this ratio. |
yarn.nodemanager.local-dirs | local filesystem의 path들 여기에 intermediate data 가 쓰여진다. 여러 path 를 적으면 disk i/o 를 분산하기 좋다. | Comma-separated list of paths on the local filesystem where intermediate data is written. | Multiple paths help spread disk i/o. |
yarn.nodemanager.log-dirs | log들이 쓰여질 local filesystem 의 path 여러 path 가 disk i/o 를 분산시키는 데 좋다. | Comma-separated list of paths on the local filesystem where logs are written. | Multiple paths help spread disk i/o. |
yarn.nodemanager.log.retain-seconds | log aggregation 이 disabled 되어야 사용가능 node manager의 log files 를 유지하는 시간 | 10800 | Default time (in seconds) to retain log files on the NodeManager Only applicable if log-aggregation is disabled. |
yarn.nodemanager.remote-app-log-dir | log aggregation 이 enabled 되어야만 사용됨 application 이 끝났을때 application log 들이 move되는 HDFS directory 적절한 permission 을 설정해야 한다. | /logs | HDFS directory where the application logs are moved on application completion. Need to set appropriate permissions. Only applicable if log-aggregation is enabled. |
yarn.nodemanager.remote-app-log-dir-suffix | log aggregation 이 enabled 되어야만 사용됨. Remote log dir 뒤에 붙는 글자(suffix) 로그는 yarn.nodemanager.remote − app − log − dir/{user}/${thisParam} 에 저장되게 된다. | logs | Suffix appended to the remote log dir. Logs will be aggregated to yarn.nodemanager.remote − app − log − dir/{user}/${thisParam} Only applicable if log-aggregation is enabled. |
yarn.nodemanager.aux-services | Map Reduce application 들을 위해 설정될 필요가 있는 Shuffle service | mapreduce_shuffle | Shuffle service that needs to be set for Map Reduce applications. |
yarn.log-aggregation.retain-seconds | aggregation log 를 얼마나 오래 가지고 있을건지 | -1 | How long to keep aggregation logs before deleting them. -1 disables. Be careful, set this too small and you will spam the name node. |
yarn.log-aggregation.retain-check-interval-seconds | aggregated log 의 유지를 얼마만에 한번씩 확인할지. 0 또는 음수는 aggregated log retention time의 1/10 로 설정된다. | -1 | Time between checks for aggregated log retention. If set to 0 or a negative value then the value is computed as one-tenth of the aggregated log retention time. Be careful, set this too small and you will spam the name node. |
mapreduce.framework.name | yarn | Execution framework set to Hadoop YARN. | |
mapreduce.map.memory.mb | map들을 위한 더큰 resource limit | 1536 | Larger resource limit for maps. |
mapreduce.map.java.opts | map들의 child jvm들의 더 큰 heap-size | -Xmx1024M | Larger heap-size for child jvms of maps. |
mapreduce.reduce.memory.mb | reduce들을 위한 더큰 resource limit | 3072 | Larger resource limit for reduces. |
mapreduce.reduce.java.opts | reduce들의 child jvm들의 더 큰 heap-size | -Xmx2560M | Larger heap-size for child jvms of reduces. |
mapreduce.task.io.sort.mb | 효율성을 위한 data 정렬을 하는 동안의 더 큰 memory limit | 512 | Higher memory-limit while sorting data for efficiency. |
mapreduce.task.io.sort.factor | 파일들을 정렬하는 동안에 한번에 merge 되는 stream들을 얼마나 더 많이 | 100 | More streams merged at once while sorting files. |
mapreduce.reduce.shuffle.parallelcopies | 매우많은 수의 map으로 부터 결과들을 가져오기 위해 reduce에 의해 수행되는 더 많은 병렬 복사(parallel copies) | 50 | Higher number of parallel copies run by reduces to fetch outputs from very large number of maps. |
mapreduce.jobhistory.address | MapReduce JobHistory Server | MapReduce JobHistory Server host:port | Default port is 10020. |
mapreduce.jobhistory.webapp.address | MapReduce JobHistory Server Web UI | MapReduce JobHistory Server Web UI host:port | Default port is 19888. |
mapreduce.jobhistory.intermediate-done-dir | MapReduce job들에 의해 history 파일들이 written 되는 directory | /mr-history/tmp | Directory where history files are written by MapReduce jobs. |
mapreduce.jobhistory.done-dir | history 파일들이 MR JobHistory Server에 의해 관리되는 directory | /mr-history/done | Directory where history files are managed by the MR JobHistory Server. |
See Also
- 하둡에서 사용하는 기본 port 정보 : Default Ports Used by Hadoop Services (HDFS, MapReduce, YARN)
댓글 없음:
댓글 쓰기